Introducing OAST

This post has been a long time in coming!  I’m happy to announce that our Oceans Across Space and Time (OAST) proposal to the NASA Astrobiology Program has been funded, launching a 5 year, 8+ institution effort to improve life detection strategies for key extraterrestrial environments.  We submitted this proposal in response to the NASA Astrobiology Institute call (NAI; a key funding opportunity within the Astrobiology Program), however, OAST was ultimately funded under a new research coordination network titled Network for Life Detection (NfoLD).  Research coordination networks are a relatively new construct for NASA and provide a better framework for exchanging information between teams than the old “node” based NAI model.  NfoLD will eventually encompass a number of NASA projects looking at various aspects of life detection and funded through a variety of different opportunities (Exobiology, PSTAR, Habitable Worlds, etc).

OAST is led by my colleague Britney Schmidt, a planetary scientist at Georgia Tech (click here for the GT press release).  Joining us from SIO is Doug Bartlett, a deep sea microbial ecologist.  Other institutions with a major role include Stanford, MIT, Louisiana State University, Kansas University, University of Texas at Austin, and Blue Marble Space Institute of Science.

The OAST science objectives are structured around the concept of contemporary, remnant, and relict ocean worlds, and predicated on the idea that the distribution of biomass and biomarkers is controlled by different factors for each of these ocean “states”.  Understanding the distribution of biomass, and the persistence of biomarkers, will lead us to better strategies for detecting life on Mars, Europa, Enceledus, and other past or present ocean worlds.

Earth is unique among the planets in our solar system for having contemporary, remnant, and relict ocean environments.  This is convenient for us as it provides an opportunity to study these environments without all the cost and bother of traveling to other planets just to try unproven techniques.  For OAST, we’ve identified a suite of ocean environments that we think best represent these ocean states.  For contemporary oceans worlds (such as Europa and Enceledus) we’re studying deep hypersaline anoxic basins (DHABs – I might have hit the acronym limit for a single post…) which may be one of the most bizarre microbial habitats on Earth.  These highly stratified ecosystems are energetically very limited and contain an extreme amount of environmental stress through pressure and high salinity.  These are very much like the conditions we’d expect on a place like Europa.  The below video from the BBC’s latest Blue Planet series provides some idea of what these environments are like.

For remnant ocean worlds we will study a number of hypersaline lake systems, such as were likely present on Mars as it transitioned from a wet to a dry world.  Unlike the contemporary Europan ocean, the remnant Martian ocean would have had lots of energy to support life, a condition shared by many saline lake environments on Earth.  This is illustrated by this photo of me holding a biomass-rich filter at the Great Salt Lake in Utah, way back in my undergraduate days.

Sunlight provides an abundance of energy to many hypersaline lake environments.  Despite the challenging conditions imposed by salt, these systems often have high rates of activity and abundant biomass.  Photo: Julian Sachs.

Relict ocean worlds are a smaller component of OAST.  This isn’t for lack of relevance – Mars is a relict ocean world after all – but you can’t do everything, even in five years and with an awesome team.  Nonetheless we will carry out work on what’s known as the Permian mid-continental deposits, to search for unambiguous biomarkers persisting from when the region was a remnant ocean.  OAST will be a big part of what we’re up to for the next five years, so stay tuned!

Posted in Uncategorized | 1 Comment

Weighted Gene Correlation Network Analysis (WGCNA) Applied to Microbial Communities

Weighted gene correlation network analysis (WGCNA) is a powerful network analysis tool that can be used to identify groups of highly correlated genes that co-occur across your samples. Thus genes are sorted into modules and these modules can then be correlated with other traits (that must be continuous variables).

Originally created to assess gene expression data in human patients, the authors of the WGCNA method and R package  have a thorough tutorial with in-depth explanations (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/). More recently, the method has been applied to microbial communities (Duran-Pinedo et al., 2011; Aylward et al., 2015; Guidi et al., 2016; Wilson et al., 2018)–the following is a walk though using microbial sequence abundances and environmental data from my 2018 work (https://www.ncbi.nlm.nih.gov/m/pubmed/29488352/).

Background: WGCNA finds how clusters of genes (or in our case abundances of operational taxonomic units–OTUs) correlates with traits (or in our case environmental variables or biochemical rates) using hierarchical clusters, novel applications of weighted adjacency functions and topological overlap measures, and a dynamic tree cutting method.

Very simply, each OTU is going to be represented by a node in a vast network and the adjacency (a score between 0 and 1) between each set of nodes will be calculated. Many networks use hard-thresholding (where a connection score [e.g. a Pearson Correlation Coefficient] between any two nodes is noted as 1 if it is above a certain threshold and noted as 0 if it is below it). This ignores the actual strength of the connection so WGCNA constructs a weighted gene (or OTU) co-occurrence adjacency matrix in lieu of ‘hard’ thresholding. Because our original matrix has abundance data the abundance of each OTU is also factored in.

For this method to work you also have to select a soft thresholding power (sft) to which each co-expression similarity is raised in order to make these scores “connection strengths”. I used a signed adjacency function:

  • Adjacency = 0.5*(1+Pearson correlation)^sft

because it preserves the sign of the connection (whether nodes are positively or negatively correlated) and this is recommendation by authors of WGCNA.

You pick your soft thresholding value by using a scale-free topology. This is based on the idea that the probability that a node is connected with k other nodes decays as a power law:

  • p(k)~ k^(-γ)

This idea is linked to the growth of networks–new nodes are more likely to be attached to already established nodes. In general, scale-free networks display a high degree of tolerance against errors (Zhang & Horvath, 2005).

You then turn your adjacency matrix into a Topological Overlap Measure (TOM) to minimize the effects of noise and spurious associations. A topological overlap of two nodes also factors in all of their shared neighbors (their relative interrelatedness)–so you are basically taking a simple co-occurrence between two nodes and placing it in the framework of the entire network by factoring in all the other nodes each is connected to. For more information regarding adjacency matrices and TOMs please see Zhang & Horvath (2005) and Langfelder & Horvath (2007 & 2008).

Start: Obtain an OTU abundance matrix (MB.0.03.subsample.fn.txt) and environmental data (OxygenMatrixMonterey.csv).

The OTU abundance matrix simply has all the different OTUs that were observed in a bunch of different samples (denoted in the Group column; e.g. M1401, M1402, etc.). These OTUs represent 16S rRNA sequences that were assessed with the universal primers 515F-Y (5′-GTGYCAGCMGCCGCGGTAA) and 926R (5′-CCGYCAATTYMTTTRAGTTT) and were created using a 97% similarity cutoff. These populations were previously subsampled to the smallest library size and all processing took place in mothur (https://www.mothur.org/). See Wilson et al. (2018) for more details.

The environmental data matrix tells you a little bit more about the different Samples, like the Date of collection, which of two site Locations it was collected from, the Depth or Zone of collection. You also see a bunch of different environmental variables like several different Upwelling indices (for different stations and different time spans), community respiration rate (CR), Oxygen Concentration, and Temperature. Again, see Wilson et al. (2018) for more details.

Code–Initial Stuff:

Read data in:

data<-read.table("MB.0.03.subsample.fn.txt",header=T,na.strings="NA")

For this particular file we have to get rid of first three columns since the OTUs don’t actually start until the 4th column:

data1 = data[-1][-1][-1]

You should turn your raw abundance values into a relative abundance matrix and potentially transform it. I recommend a Hellinger Transformation (a square root of the relative abundance)–this effectively gives low weight to variables with low counts and many zeros. If you wanted you could do the Logarithmic transformation of Anderson et al. (2006) here in stead.

library("vegan", lib.loc="~/R/win-library/3.3")
HellingerData<-decostand(data1,method = "hellinger")

You have to limit the OTUs to the most frequent ones (ones that occur in multiple samples so that you can measure co-occurance across samples). I just looked at my data file and looked for where zeros became extremely common. This was easy because mothur sorts OTUs according to abundance. If you would like a more objective way of selecting the OTUs or if your OTUs are not sorted you then this code may help:

lessdata <- data1[,colSums(data1) > 0.05]

(though you will have to decide what cutoff works best for your data).

Code–Making your relative abundance matrix:

You have to reattach the Group Name column:

RelAbun1 = data.frame(data[2],HellingerData[1:750])

Write file (this step isn’t absolutely necessary, but you may want this file later at some point):

write.table(RelAbun1, file = "MontereyRelAbun.txt", sep="\t")

Code–Visualizing your data at the sample level:

Now load the WGCNA package:

library("WGCNA", lib.loc="~/R/win-library/3.3")

Bring data in:

OTUs<-read.table("MontereyRelAbun.txt",header=T,sep="\t")
dim(OTUs);
names(OTUs);

Turn the first column (sample name) into row names (so that only OTUs make up actual columns):

datExpr0 = as.data.frame((OTUs[,-c(1)]));
names(datExpr0) = names(OTUs)[-c(1)];
rownames(datExpr0) = OTUs$Group;

Check Data for excessive missingness:

gsg = goodSamplesGenes(datExpr0[-1], verbose = 3);
gsg$allOK

You should get TRUE for this dataset given the parameters above. TRUE means that all OTUs have passed the cut. This means that when you limited your OTUs to the most common ones above that you didn’t leave any in that had too many zeros. It is still possible that you were too choosy though. If you got FALSE for your data then you have to follow some other steps that I don’t go over here.

Cluster the samples to see if there are any obvious outliers:

sampleTree = hclust(dist(datExpr0), method = "average");

sizeGrWindow(12,9)

par(cex = 0.6);
par(mar = c(0,4,2,0))

plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)

The sample dendrogram doesn’t show any obvious outliers so I didn’t remove any samples. If you need to remove some samples then you have to follow some code I don’t go over here.

Now read in trait (Environmental) data and match with expression samples:

traitData = read.csv("OxygenMatrixMonterey.csv");
dim(traitData)
names(traitData)

Form a data frame analogous to expression data (relative abundances of OTUs) that will hold the Environmental traits:

OTUSamples = rownames(datExpr0);
traitRows = match(OTUSamples, traitData$Sample);
datTraits = traitData[traitRows, -1];
rownames(datTraits) = traitData[traitRows, 1];
collectGarbage()

Outcome: Now your OTU expression (or abundance) data are stored in the variable datExpr0 and the corresponding environmental traits are in the variable datTraits. Now you can visualize how the environmental traits relate to clustered samples.

Re-cluster samples:

sampleTree2 = hclust(dist(datExpr0), method = "average")

Convert traits to a color representation: white means low, red means high, grey means missing entry:

traitColors = numbers2colors(datTraits[5:13], signed = FALSE);

Plot the sample dendrogram and the colors underneath:

plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits[5:13]),
main = "Sample dendrogram and trait heatmap")

Again: white means a low value, red means a high value, and gray means missing entry. This is just initial stuff… we haven’t looked at modules of OTUs that occur across samples yet.

Save:

save(datExpr0, datTraits, file = "Monterey-dataInput.RData")

Code–Network Analysis:

Allow multi-threading within WGCNA. This helps speed up certain calculations.
Any error here may be ignored but you may want to update WGCNA if you see one.

options(stringsAsFactors = FALSE);
enableWGCNAThreads()

Load the data saved in the first part:

lnames = load(file = "Monterey-dataInput.RData");

The variable lnames contains the names of loaded variables:

lnames

Note: You have a couple of options for how you create your weighted OTU co-expression network. I went with the step-by-step construction and module detection. Please see this document for information on the other methods (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/Simulated-05-NetworkConstruction.pdf).

Choose a set of soft-thresholding powers:

powers = c(c(1:10), seq(from = 11, to=30, by=1))

Call the network topology analysis function:
Note: I am using a signed network because it preserves the sign of the connection (whether nodes are positively or negatively correlated); this is recommendation by authors of WGCNA.

sft = pickSoftThreshold(datExpr0, powerVector = powers, verbose = 5, networkType = "signed")

Output:

pickSoftThreshold: will use block size 750.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 750 of 750
   Power SFT.R.sq slope truncated.R.sq  mean.k. median.k. max.k.
1      1   0.0299  1.47          0.852 399.0000  400.0000 464.00
2      2   0.1300 -1.74          0.915 221.0000  221.0000 305.00
3      3   0.3480 -2.34          0.931 128.0000  125.0000 210.00
4      4   0.4640 -2.41          0.949  76.3000   73.1000 150.00
5      5   0.5990 -2.57          0.966  47.2000   44.0000 111.00
6      6   0.7010 -2.52          0.976  30.1000   27.1000  83.40
7      7   0.7660 -2.47          0.992  19.8000   17.2000  64.30
8      8   0.8130 -2.42          0.986  13.3000   11.0000  50.30
9      9   0.8390 -2.34          0.991   9.2200    7.1900  40.00
10    10   0.8610 -2.24          0.992   6.5200    4.8800  32.20
11    11   0.8670 -2.19          0.987   4.7000    3.3700  26.20
12    12   0.8550 -2.18          0.959   3.4600    2.3300  21.50

This is showing you the power (soft thresholding value), the r2 for the scale independence for each particular power (we shoot for an r2 higher than 0.8), the mean number of connections each node has at each power (mean.k), the median number of connections/node (median.k), and the maximum number of connections (max.k).

Plot the results:

sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;

Scale-free topology fit index (r2) as a function of the soft-thresholding power:

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");

This line corresponds to using an R^2 cut-off of h:

abline(h=0.8,col="red")

Mean connectivity as a function of the soft-thresholding power:

plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

I picked a soft thresholding value of 10 because it was well above an r2 of 0.8 (it is a local peak for the r2) and the mean connectivity is still above 0.

So now we just calculate the adjacencies, using the soft thresholding power of 10:

softPower = 10;
adjacency = adjacency(datExpr0, power = softPower, type = "signed");

Then we transform the adjacency matrix into a Topological Overlap Matrix (TOM) and calculate corresponding dissimilarity:

Remember: The TOM you calculate shows the topological similarity of nodes, factoring in the connection strength two nodes share with other “third party” nodes. This will minimize effects of noise and spurious associations:

TOM = TOMsimilarity(adjacency, TOMType = "signed");
dissTOM = 1-TOM

Create a dendogram using a hierarchical clustering tree and then call the hierarchical clustering function:

TaxaTree = hclust(as.dist(dissTOM), method = "average");

Plot the resulting clustering tree (dendrogram):

sizeGrWindow(12,9)
plot(TaxaTree, xlab="", sub="", main = "Taxa clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04);

This image is showing us the clustering of all 750 OTUs based on the TOM dissimilarity index.

Now you have to decide the optimal module size for your system and should play around with this value a little. I wanted relatively large module so I set the minimum module size relatively high at 30:

minModuleSize = 30;

Module identification using dynamic tree cut (there a couple of different ways to figure out your modules and so you should explore what works best for you in the tutorials by the authors):

dynamicMods = cutreeDynamic(dendro = TaxaTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize);
table(dynamicMods)

Convert numeric labels into colors:

dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
dynamicColors
    black      blue     brown     green       red turquoise    yellow 
       49       135       113        71        64       216       102

You can see that there are a total of 7 modules (you should have seen that above too) and that now each module is named a different color. The numbers under the colors tells you how many OTUs were sorted into that module. Each OTU is in exactly 1 module, and you can see that if you add up all of the numbers from the various modules you get 750 (the number of OTUs that we limited our analysis to above).

Plot the dendrogram with module colors underneath:

sizeGrWindow(8,6)
plotDendroAndColors(TaxaTree, dynamicColors, "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Taxa dendrogram and module colors")

Now we will quantify co-expression similarity of the entire modules using eigengenes and cluster them based on their correlation:
Note: An eigengene is 1st principal component of a module expression matrix and represents a suitably defined average OTU community.

Calculate eigengenes:

MEList = moduleEigengenes(datExpr0, colors = dynamicColors)
MEs = MEList$eigengenes

Calculate dissimilarity of module eigengenes:

MEDiss = 1-cor(MEs);

Cluster module eigengenes:

METree = hclust(as.dist(MEDiss), method = "average");

Plot the result:

sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")

Now we will see if any of the modules should be merged. I chose a height cut of 0.30, corresponding to a similarity of 0.70 to merge:

MEDissThres = 0.30

Plot the cut line into the dendrogram:

abline(h=MEDissThres, col = "red")

You can see that, according to our cutoff, none of the modules should be merged.

If there were some modules that needed to be merged you can call an automatic merging function:

merge = mergeCloseModules(datExpr0, dynamicColors, cutHeight = MEDissThres, verbose = 3)

The merged module colors:

mergedColors = merge$colors;

Eigengenes of the new merged modules:

mergedMEs = merge$newMEs;

If you had combined different modules then that would show in this plot:

sizeGrWindow(12, 9)

plotDendroAndColors(TaxaTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)

If we had merged some of the modules that would show up in the Merged dynamic color scheme.

Rename the mergedColors to moduleColors:

moduleColors = mergedColors

Construct numerical labels corresponding to the colors:

colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)-1;
MEs = mergedMEs;

Save module colors and labels for use in subsequent parts:

save(MEs, moduleLabels, moduleColors, TaxaTree, file = "Monterey-networkConstruction-stepByStep.RData")

Code–Relating modules to external information and IDing important taxa:

Here you are going to identify modules that are significantly associate with environmental traits/biogeochemical rates. You already have summary profiles for each module (eigengenes–remeber that an eigengene is 1st principal component of a module expression matrix and represents a suitably defined average OTU community), so we just have to correlate these eigengenes with environmental traits and look for significant associations.

Defining numbers of OTUs and samples:

nTaxa = ncol(datExpr0);
nSamples = nrow(datExpr0);

Recalculate MEs (module eigengenes):

MEs0 = moduleEigengenes(datExpr0, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

Now we will visualize it:

sizeGrWindow(10,6)

textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));

labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))

Each row corresponds to a module eigengene and each column corresponds to an environmental trait or biogeochemical rate (as long as it is continuous–notice that the categorical variables are gray and say NA). Each cell contains the corresponding Pearson correlation coefficient (top number) and a p-value (in parentheses). The table is color-coded by correlation according to the color legend.

You can see that the Brown module is positively correlated with many indices of upwelling while the Black module is negatively correlated with many indices of upwelling. For this work I was particularly interested in CR and so I focused on modules the positively or negatively correlated with CR. The Red module was negatively associated with CR while the Blue module was positively associated with CR.

Let’s look more at the Red module by quantifying the associations of individual taxa with CR:

First define the variable we are interested in from datTrait:

CR = as.data.frame(datTraits$CR);
names(CR) = "CR"

modNames = substring(names(MEs), 3)
TaxaModuleMembership = as.data.frame(cor(datExpr0, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(TaxaModuleMembership), nSamples));
names(TaxaModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
TaxaTraitSignificance = as.data.frame(cor(datExpr0, CR, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(TaxaTraitSignificance), nSamples));
names(TaxaTraitSignificance) = paste("GS.", names(CR), sep="");
names(GSPvalue) = paste("p.GS.", names(CR), sep="");

module = "red"
column = match(module, modNames);
moduleTaxa = moduleColors==module;
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(TaxaModuleMembership[moduleTaxa, column]),
abs(TaxaTraitSignificance[moduleTaxa, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Taxa significance for CR",
main = paste("Module membership vs. Taxa significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

This graph shows you how each taxa (each red dot is an OTU that belongs in the Red module) correlated with 1) the Environmental trait of interest and 2) how important it is to the module. The taxa/OTUs that have high module membership tend to occur whenever the module is represented in the environment and are therefore often connected throughout the samples with other red taxa/OTUs. In this module, these hubs (Red OTUs that occur with other Red OTUs) are also the most important OTUs for predicting CR.

Now lets get more info about the taxa that make up the Red module:

First, merge the statistical info from previous section (modules with high assocation with trait of interest–e.g. CR or Temp) with taxa annotation and write a file that summarizes these results:

names(datExpr0)
names(datExpr0)[moduleColors=="red"]

You will have to feed in an annotation file–a file listing what Bacteria/Archaea go with each OTU (I am not providing you will this file, but it just had a column with OTUs and a column with the Taxonomy).

annot = read.table("MB.subsample.fn.0.03.cons.taxonomy",header=T,sep="\t");
dim(annot)
names(annot)
probes = names(datExpr0)
probes2annot = match(probes, annot$OTU)

Check for the number or probes without annotation (it should return a 0):

sum(is.na(probes2annot))

Create the starting data frame:

TaxaInfo0 = data.frame(Taxon = probes,
TaxaSymbol = annot$OTU[probes2annot],
LinkID = annot$Taxonomy[probes2annot],
moduleColor = moduleColors,
TaxaTraitSignificance,
GSPvalue)

Order modules by their significance for weight:

modOrder = order(-abs(cor(MEs, CR, use = "p")));

Add module membership information in the chosen order:

for (mod in 1:ncol(TaxaModuleMembership))
{
oldNames = names(TaxaInfo0)
TaxaInfo0 = data.frame(TaxaInfo0, TaxaModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(TaxaInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}

Order the OTUs in the geneInfo variable first by module color, then by geneTraitSignificance:

TaxaOrder = order(TaxaInfo0$moduleColor, -abs(TaxaInfo0$GS.CR));
TaxaInfo = TaxaInfo0[TaxaOrder, ]

Write file:

write.csv(TaxaInfo, file = "TaxaInfo.csv")

Here is a bit of the output file I got:

Taxon TaxaSymbol LinkID moduleColor GS.TotalRate p.GS.TotalRate MM.red p.MM.red MM.blue p.MM.blue MM.green p.MM.green MM.brown p.MM.brown MM.turquoise p.MM.turquoise MM.black p.MM.black MM.yellow p.MM.yellow
Otu00711 Otu00711 Otu00711 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);Surface_4(100); black 0.461005 0.00111 0.005028 0.973244 0.243888 0.098526 -0.07719 0.606075 -0.25274 0.086535 0.058878 0.694233 0.502027 0.000324 0.132412 0.374947
Otu00091 Otu00091 Otu00091 Bacteria(100);Bacteroidetes(100);Flavobacteriia(100);Flavobacteriales(100);Flavobacteriaceae(100);Formosa(100); black 0.378126 0.008778 -0.17243 0.246462 0.446049 0.001676 0.34467 0.017667 -0.55057 6.08E-05 0.492517 0.000437 0.615168 4.20E-06 0.367211 0.011115
Otu00082 Otu00082 Otu00082 Bacteria(100);Bacteroidetes(100);Flavobacteriia(100);Flavobacteriales(100);Flavobacteriaceae(100);NS5_marine_group(100); black -0.35649 0.013911 0.222515 0.132755 -0.06428 0.667734 0.175654 0.237601 -0.45502 0.001312 0.421756 0.003151 0.750195 1.28E-09 0.126362 0.397349
Otu00341 Otu00341 Otu00341 Bacteria(100);Bacteroidetes(100);Cytophagia(100);Cytophagales(100);Flammeovirgaceae(100);Marinoscillum(100); black -0.28242 0.054435 0.023927 0.873162 -0.07555 0.613762 0.144688 0.331879 -0.03144 0.833838 0.035147 0.814565 0.209255 0.158058 -0.06565 0.661083
Otu00537 Otu00537 Otu00537 Bacteria(100);Verrucomicrobia(100);Verrucomicrobiae(100);Verrucomicrobiales(100);Verrucomicrobiaceae(100);Persicirhabdus(100); black -0.23668 0.109211 0.123673 0.40755 -0.17362 0.243171 -0.05738 0.701628 -0.26399 0.072961 0.264411 0.072493 0.425082 0.002897 0.040045 0.789278
Otu00262 Otu00262 Otu00262 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);Surface_1(100);Candidatus_Pelagibacter(90); black -0.23615 0.110023 0.327396 0.02468 -0.22748 0.12411 -0.13779 0.355699 -0.23708 0.108594 0.271968 0.064409 0.554592 5.23E-05 0.036113 0.809563
Otu00293 Otu00293 Otu00293 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);SAR11_clade_unclassified(100); black 0.223427 0.131133 0.142016 0.34098 0.209327 0.157912 0.234713 0.112274 -0.53032 0.000126 0.529907 0.000128 0.627937 2.30E-06 0.390714 0.006621
Otu00524 Otu00524 Otu00524 Bacteria(100);Actinobacteria(100);Acidimicrobiia(100);Acidimicrobiales(100);OM1_clade(100);Candidatus_Actinomarina(100); black -0.20559 0.165629 0.28312 0.053809 0.016758 0.910982 0.148756 0.318316 -0.34758 0.016669 0.376043 0.009188 0.494903 0.000406 0.377597 0.00888
Otu00370 Otu00370 Otu00370 Bacteria(100);Verrucomicrobia(100);Opitutae(100);MB11C04_marine_group(100);MB11C04_marine_group_unclassified(100); black -0.20397 0.169074 0.303984 0.037771 -0.06655 0.656707 0.009401 0.949991 -0.25451 0.084275 0.097595 0.514 0.642409 1.13E-06 0.224711 0.128875

 

NOTES on output:

moduleColor is the module that the OTU was ultimately put into

GS stands for Gene Significance (for us it means taxon significance) while MM stands for module membership.

GS.Environmentaltrait = Pearson Correlation Coefficient for that OTU with the trait. GS allows incorporation of external info into the co-expression network by showing gene/OTU significance. The higher the absolute value of GS the more biologically significant the gene (or in our case taxa) to that external variable (e.g. CR).
p.GS.Environmentaltrait = P value for the preceding relationship.

MM.color = Pearson Correlation Coefficient for Module Membership–i.e. how well that OTU correlates with that particular color module (each OTU has a value for each module but only belongs to one module). If close to 0 or negative then the taxa is not part of that color module (since each OTU has to be put in a module you may get some OTUs that are close to 0, but they aren’t important to that module). If it is close to 1 then it is highly connected to that color module, but will be placed with the color module that it is most connected to throughout the samples.
p.MM.color = P value for the preceding relationship.

Modules will be ordered by their significance for the external variable you selected (e.g. CR), with the most significant ones to the left.
Each of the modules (with each OTU assigned to exactly one module) will be represented for the environmental trait you selected.
You will have to rerun this for each environmental trait you are interested in.

 

Posted in Computer tutorials | 3 Comments

NASA Postdoctoral Program opportunity

A new NASA Postdoctoral Program (NPP) opportunity was posted today for our Oceans Across Space and Time (OAST) initiative.  What’s OAST?  I can’t tell you yet, because the relevant press releases have been stuck in purgatory for several weeks.  Hopefully I can write that post next week.  Nonetheless you can figure out what we’re all about based on the NPP opportunity here, and pasted below.  If you’re interested act fast, NPP proposals are due November 1!

Past and present oceans are widely distributed in our solar system and are among the best environments to search for extant life. New tools, techniques, and strategies are required to support future life detection missions to ocean worlds. Oceans Across Space and Time (OAST) is a new project under the Network for Life Detection (NfoLD) research coordination network. OAST seeks to understand the distribution, diversity, and limits of life on contemporary, remnant, and relict ocean worlds to facilitate the search for extant life. Central to this effort is the development of an Ocean Habitability Index that characterizes life in the context of the physicochemical stressors and metabolic opportunities of an ocean environment. OAST will be developing the OHI based on field efforts in surficial hypersaline environments and deep hypersaline anoxic basins, and from laboratory-based experimental studies.

Postdoctoral fellows are sought to support OAST activities across three themes: characterizing ocean habitability, detecting and measuring biological activity, and understanding diversity and the limits of evolution. Postdoctoral fellows are expected to interact across two or more institutions within OAST. Participating institutions are Georgia Institute of Technology (Schmidt, Glass, Ingall, Reinhardt, Stewart, Stockton), Scripps Institution of Oceanography at UC San Diego (Bowman, Bartlett), Massachusetts Institute of Technology (Pontrefact and Carr), Louisiana State University (Doran), University of Kansas (Olcott), Stanford University (Dekas), Blue Marble Space Institute of Science (Som), and the University of Texas at Austin (Soderlund). Candidates should contact the PI (Schmidt) and two potential mentors early in the proposal process to scope possible projects in line with the main directions of OAST and NFoLD.

Posted in Uncategorized | Leave a comment

Postdoctoral researcher sought in bioinformatics/predictive analytics

We have an opening for a postdoctoral scholar in the area of bioinformatics and predictive analytics.  The ideal candidate will have demonstrated expertise in one of these two areas and a strong interest in the other.  Possible areas of expertise within predictive analytics include the development and application of neural networks and other machine learning techniques, or nonlinear statistical modeling techniques such as empirical dynamical modeling.  Possible areas of expertise within bioinformatics include genome assembly and annotation, metagenomic or metatranscriptomic analysis, or advanced community structure analysis.  Candidates should be proficient with R, Python, or Matlab, familiar with scientific computing in Linux, and fluent in English.

The successful candidate will take the lead in an exciting new industry collaboration to predict process thresholds from changes in microbial community structure.  The goal of this collaboration is not only to predict thresholds, but to fully understand the underlying ecosystem dynamics through genomics and metabolic modeling.  The candidate will be in residence at Scripps Institution of Oceanography at UC San Diego, but will have the opportunity to work directly with industry scientists in the San Diego area.  The candidate is expected to take on a leadership role in the Bowman Lab, participate in training graduate students and undergraduates, represent the lab at national and international meetings, and publish their work in scholarly journals.  The initial opportunity is for two years with an option for a third year contingent on progress made.  The position starts January of 2019.

Interested candidates should send a CV and a brief, 1-page statement of interest to Jeff Bowman at jsbowman at ucsd.edu.  The position will remain open until filled, but interested candidates are encouraged to apply by 11/16 for first consideration.

Posted in Job opportunities | Leave a comment

Separating bacterial and archaeal reads for analysis with paprica

One of the most popular primer sets for 16S rRNA gene amplicon analysis right now is the 515F/806R set. One of the advantages of this pair is that it amplifies broadly across the domains Archaea and Bacteria. This reduces by half the amount of work required to characterize prokaryotic community structure, and allows a comparison of the relative (or absolute, if you have counts) abundance of bacteria and archaea.  However, paprica and many other analysis tools aren’t designed to simultaneously analyze reads from both domains.  Different reference alignments or covariance models, for example, might be required.  Thus it’s useful to split an input fasta file into separate bacterial and archaeal files.

We like to use the Infernal tool cmscan for this purpose.  First, you’ll need to acquire covariance models for the 16S/18S rRNA genes from all three domains of life.  You can find those on the Rfam website, they are also included in paprica/models if you’ve downloaded paprica.  Copy the models to new subdirectory in your working directory while combining them into a single file:

mkdir cms
cat ../paprica/models/*cm > cms/all_domains.cm
cd cms

Now you need to compress and index the covariance models using the cmpress utility provided by Infernal.  This takes a while.

cmpress all_domains.cm

Pretty simple.  Now you’re ready to do some work.  The whole Infernal suite of tools has pretty awesome built-in parallelization, but with only three covariance models in the file you won’t get much out of it.  Best to minimize cmscan’s use of cores and instead push lots of files through it at once.  This is easily done with the Gnu Parallel command:

ls *.fasta | parallel -u cmscan --cpu 1 --tblout {}.txt cmscan/all_domains.cm {} > /dev/nul

Next comes the secret sauce.  The command above produces an easy-to-parse, easy-to-read table with classification stats for each of the covariance models that we searched against.  Paprica contains a utility in paprica/utilities/pick_16S_domain.py to parse the table and figure out which model scored best for each read, then make three new fasta files for each of domains Bacteria, Archaea, and Eukarya (the primers will typically pick up a few euks).  We’ll parallelize the script just as we did for cmscan.

ls *.fasta | parallel -u python pick_16S_domain_2.py -prefix {} -out {}

Now you have domain-specific files that you can analyze in paprica or your amplicon analysis tool of choice!

 

 

Posted in paprica | 2 Comments

We’re joining MOSAiC!

The predicted drift track of Polarstern during MOSAiC, taken from the MOSAiC website.

This has been a (rare) good week for funding decisions.  A loooooong time ago when I was a third year PhD student I wrote a blog post that mentioned the Multidisciplinary drifting Observatory for the Study of Arctic Climate (MOSAiC) initiative.  The Arctic has changed a lot over the last couple of decades, as evidenced by the shift from perennial (year long) to seasonal sea ice cover.  This is a problem for climate modelers whose parameterizations and assumptions are based on observational records largely developed in the “old” Arctic.  MOSAiC was conceived as a coupled ocean-ice-atmosphere study to understand key climatic and ecological processes as they function in the “new” Arctic.  The basic idea is to drive the German icebreaker Polarstern into the Laptev Sea in the Fall of 2019, tether it to an ice flow, and allow it to follow the circumpolar drift (Fram style) for a complete year.  This will provide for continuous time-series observations across the complete seasonal cycle, and an opportunity to carry out a number of key experiments.

I first attended a MOSAiC workshop in 2012 (when I wrote that first blog post).  It only took six years and two tries, but we’ve officially received NSF support to join the MOSAiC expedition.  Co-PI Brice Loose at URI and I will be carrying out a series of experiments to better understand how microbial community structure and ecophysiology control fluxes of O2 (as a proxy for carbon) and methane in the central Arctic Ocean.  The central Arctic Ocean is a weird place and we know very little about how these processes play out there. Like the subtropical gyres it’s extremely low in nutrients, but the low temperatures, extreme seasonality, (seasonal) sea ice cover, and continental shelf influences make it very different from the lower-latitude oceans.

The German icebreaker Polarstern, which will host the MOSAiC field effort.

Posted in MOSAiC, Uncategorized | 1 Comment

CURE-ing microbes on ocean plastic

I was excited to learn today that our proposal CURE-ing microbes on ocean plastic, led by collaborators Ana Barral and Rachel Simmons at National University, was just funded by the National Science Foundation through the Hispanic Serving Institutions (HSI) program.  The Bowman Lab will play a small but important role in this project and it should be quite a bit of fun.  As a private, non-profit college National University is a somewhat unusual organization that serves a very different demographic than UC San Diego.  A large number of the students at National University are non-traditional, meaning that they aren’t going straight to college from high school.  A significant number are veterans, and National University is classified as an HSI.  So working with National University is a great opportunity to interact with a very different student body.

A microbial colonization pilot experiment attached to our mounting bracket on the SIO Pier in May.  If you look closely you can see the biofilm developing on the plastic wafers inside the cage.  The composition of the biofilm and its potential to degrade plastics will be explored by students at National University.

For this project Ana and Rachel developed a course-based undergraduate research experience (CURE) around the topic of ocean plastics.  This is an issue that is getting quite a bit of attention in the popular press, but has largely fallen through the cracks as far as research goes, possibly because it’s too applied for NSF and to basic (and expensive) for NOAA.  A lot of somewhat dubious work is going on in the fringe around the theme of cleaning up marine plastics, but we lack a basic understanding of the processes controlling the decomposition of plastics, and their entry into marine foodwebs (topics that do less to stoke the donor fires than clean up).  This CURE is a way to get students (and us) thinking about the science of marine plastics, specifically the colonization and degradation of plastics by marine microbes, while learning basic microbiology, molecular biology, and bioinformatic techniques.  The basic idea is to have the students deploy plastic colonization experiments in the coastal marine environment, isolate potential plastic degraders, sequence genomes, and carry out microbial community structure analysis.  Different courses will target different stages in the project, and students taking the right sequence of courses could be hands-on from start to finish.

Our role in the project is to provide the marine microbiology expertise.  Lab members will have an opportunity to give lectures and provide mentoring to National University students, and we’ll handle the deployment and recovery of plastic-colonization experiments on the SIO Pier (yay, more diving!).  We’ll also play a role in analyzing and publishing the data from these experiments.  Many thanks to lab alumnus Emelia DeForce (formerly with MoBio and now with Fisher Scientific) for bringing us all together on this project!

The

Ana Barral’s (center) B100A: Survey of Biosciences Lab at National University after recovery of the microbial colonization pilot experiment in June.  Thanks Allison Lee (at left) for coming in to dive on a Saturday morning!

Posted in Uncategorized | 1 Comment

A practical lesson in marine metals

This week Alyssa Demko from the Jenkins Lab and I dove to make repairs on our sample pump intake on the SIO pier.  Very soon the pump will supply water to our membrane inlet mass spectrometer and biological sampling manifold, so we’re eager to keep things in good working condition.  Our pump intake is secured to the pier by a very heavy stainless steel metal bracket.  When we first installed the metal bracket we opted for silicon bronze hardware; silicon bronze is pricey but among the most corrosion resistant alloys available.  When we last dove I noticed the hardware was corroding very rapidly, to the point that a good storm would have ripped the whole contraption off the pier.  Fortunately it’s summer!  Here’s some of the hardware that we recovered from our dive:

Silicon

When installed these were 5 x 0.5 inch bolts, the head was 3/4 inch.  This is some serious corrosion for a 4 month deployment!  Silicon bronze is supposed to be corrosion resistant, so what happened?

The problem is that when two or more metals are in contact in the presence of a electrolyte (like seawater) they interact.  Specifically, some metals like to donate electrons to other metals.  The metal doing the donating (called the anode) corrodes more quickly than the metal that receives them (called the cathode).  Because this transfer replaces electrons that the cathode loses to seawater, the presence of the anode actually slows the corrosion of the cathode.  This is a well known process that we planned for when we designed the system, and we included a zinc plate called a sacrificial anode that serves no purpose other than to donate electrons to the stainless steel bracket.  Enter silicon bronze.

How readily one metal donates electrons to another metal is indicated by their location on the galvanic series.  The further apart two metals are, the more readily electrons flow from the anodic to the cathodic metal.  Silicon bronze is more anodic that stainless steel, particularly the 316 alloy we are using, but I figured it was close enough.  Apparently not, however, and we didn’t account for other factors that can influence the rate of electron transfer.  An important one is the surface area of the cathode relative to the anode.  Remember that the cathode is losing electrons to seawater, this is what is driving the flow of electrons from the anode to the cathode in the first place (if you put them in contact in say, air, mineral oil, or some other non-conductive medium nothing will happen).  So the more surface of the cathode is in contact with seawater, the more electrons will flow from the cathode to seawater, and from the anode to the cathode.  In our system the relatively small bronze bolts were attached to the a very large stainless steel bracket, and I think this accounts for the rapid corrosion.

There is one thing that I still don’t understand, which is why the zinc anode in the system didn’t protect the bronze and stainless steel.  Bronze will sacrifice itself for stainless steel (as we’ve clearly demonstrated), but zinc should sacrifice to bronze and stainless steel.  However, our sacrificial zinc anode looks almost as good as it did when I installed it.  In the meantime here’s a video of the impressive lobster party taking place on the piling where our sampling system is located (don’t even think about it, it’s a no take zone!). At the end of the video you can see our shiny new 316 stainless steel bolts. Hopefully these last!

Posted in Uncategorized | 2 Comments

Simons ECIMMEE award

The Simons Foundation Early Career Investigator in Marine Microbial Ecology and Evolution (ECIMMEE) awards were announced today and I’m thrilled that the Bowman Lab was selected.  Our project is centered on using the SIO Pier as a unique platform for collecting ecological data at a high temporal resolution.  Consider that marine heterotrophic bacteria often have cell division times on the order of hours – even less under optimal conditions – and that entire populations can be decimated by grazing or viral attack on similarly short timescales.  A typical long term study might sample weekly; the resulting time-series is blind to the dynamics that occur on shorter time-scales.  This makes it challenging to model key ecological processes, such as the biogeochemical consequences of certain microbial taxa being active in the system.

Over the past year we’ve been slowly developing the capability to conduct high(er)-resolution time-series sampling at the SIO Pier.  This award will allow us to take these efforts to the next level and really have some fun, from an ecological observation and modeling point of view.  Our goal is to develop a sampling system that is agnostic to time, but instead observes microbial community structure and other parameters along key ecological gradients like temperature and salinity.  Following the methods in our 2017 ISME paper, we can model key processes like bacterial production from community structure and physiology data, allowing us to predict those processes for stochastic events that would be impossible to sample in person.

The SIO Pier provides a unique opportunity to sample ocean water within minutes of the lab. 

Sometimes it’s useful to visualize all that goes into scientific observations as a pyramid.  At the tip of this pyramid is a nice model describing our ecological processes of interest.  Way down at the base is a whole lot of head-scratching, knuckle-scraping labor to figure out how to make the initial observations to inform that model.  One of our key challenges – which seems so simple – is just getting water up to the pier deck where we can sample it in an automated fashion.  The pier deck is about 30 feet above the water, and most pumps designed to raise water to that height deliver much too much for our purposes.  We identified a pneumatic pumping system that does the job nicely, but the pump intake requires fairly intensive maintenance and a lot of effort to keep the biology off it.  Here’s a short video of me attempting to clean and reposition the (kelp covered) pump intake on Monday, shot by Gabriel Castro, a graduate student in the Marine Natural Products program at SIO (thanks Gabriel for the assist!).  Note the intense phytoplankton bloom and moderate swell, not an easy day!

Posted in Uncategorized | 4 Comments

New seascape analysis of the western Antarctic Peninsula

We have a paper “Recurrent seascape units identify key ecological processes along the western Antarctic Peninsula” that is now available via advance-online through the journal Global Change Biology.  I place full blame for this paper on my former postdoctoral advisor and co-author Hugh Ducklow.  Shortly after I arrived for my postdoc at the Lamont-Doherty Earth Observatory, Hugh suggested using all the core Palmer LTER parameters since the start of the start of that program, and “some kind of multivariate method” to identify different years.  The presumption was that different years would map to some kind of recognizable ecological or climatic phenomenon.

At the time I knew nothing about seascapes or geospatial analysis.  However, I had been playing around with self organizing maps (SOMs) to segment microbial community structure data.  I thought that similarly segmenting geospatial data would yield an interesting result, so we gave it a go.  This involved carefully QC’ing all the core Palmer LTER data since 1993 (sadly discarding several years with erroneous or missing parameters), interpolating the data for each year to build 3 dimensional maps of each parameter (you can find these data here), then classifying each point in these maps with a SOM trained on the original data.  After a lot of back and forth with co-authors Maria Kavanaugh and Scott Doney, we elected to use the term “seascape unit” for different regions of the SOM.  Our classification scheme ultimately maps these seascape units to the original sampling grid.  By quantifying the extent of each seascape unit in each year we can attempt to identify similar years, and also identify climatic phenomena that exert controls on seascape unit abundance.

If you’re scratching your head at why it’s necessary to resort to seascape units for such an analysis it’s helpful to take a look at the training data in the standard T-S plot space.

Fig. 2 from Bowman et al., 2018.  Distribution of the training data in a) silicate-nitrate space and b) T-S space.  The color of the points gives the seascape unit.

The “V” distribution of points is characteristic of the western Antarctic Peninsula (WAP), and highlights the strong, dual relationship between temperature and salinity.  The warmest, saltiest water is associated with upper circumpolar deepwater (UCDW) and originates from offshore in the Antarctic Circumpolar Current (ACC).  The coldest water is winter water (WW), which is formed from brine rejection and heat loss during the winter.  Warm, fresh water is associated with summer surface water (SW).  Note, however, that multiple seascape unites are associated with each water mass.  The reason for this is that nutrient concentrations can vary quite a bit within each water mass.  My favorite example is WW, which we usually think of as rich in nutrients (nutrient replete); it is, but WW associated with SU 2 is a lot less nutrient rich than that associated with SU 3.  Both will support a bloom, but the strength, duration, and composition of the bloom is likely to differ.

To evaluate how different climatic phenomena might influence the distribution of seascape units in different years we applied elastic-net regression as described here.  This is where things got a bit frustrating.  It was really difficult to build models that described a reasonable amount of the variance in seascape unit abundance.  Where we could the usual suspects popped out as good predictors; October and January ice conditions play a major role in determining the ecological state of the WAP.  But it’s clear that lots of other things do as well, or that the tested predictors are interacting in non-linear ways that make it very difficult to predict the occurrence of a given ecosystem state.

We did get some interesting results predicting clusters of years.  Based on hierarchical clustering, the relative abundance of seascape units in a core sampling area suggests two very distinct types of years.  We tested models based on combinations of time-lagged variables (monthly sea ice extent, fraction of open water, ENSO, SAM, etc.) to predict year-type, with June and October within-pack ice open water extent best predicting year-type.  This fits well with our understanding of the system; fall and spring storm conditions are known to exert a control on bloom conditions the following year.  In our analysis, when the areal extent of fall and spring within-pack ice open water is high (think cold but windy), chlorophyll the following summer is associated with a specific seascape (SU 1 below) that is found most frequently offshore.  When the opposite conditions prevail, chlorophyll the following summer is associated with a specific seascape (SU 8) that is found most frequently inshore.  Interestingly, the chlorophyll inventory isn’t that different between year-types, but the density and distribution of chlorophyll is, which presumably matters for the higher trophic levels (that analysis is somewhere on my to-do list).

Fig. 3 from Bowman et al., 2018.  Clustering of the available years into different year-types.  The extent of within-pack ice open water in June and October are reasonable predictors of year-type.  Panel A shows the relative abundance of seascape unit for each year-type.  Panel B shows the fraction of chlorophyll for each year that is associated with each seascape unit.

One of our most intriguing findings was a steady increase in the relative abundance of SU 3 over time.  SU 3 is one of those seascapes associated with winter water; it’s the low nutrient winter water variant.  That steady increase means that there has been a steady decrease in the bulk silicate inventory in the study area with time.  I’m not sure what that means, though my guess is there’s been an increase in early season primary production which could draw down nutrients while winter water is still forming.

Fig. 5 from Bowman et al., 2018.  SU 3, a low-nutrient flavor of winter water, has been increasing in relative abundance.  This has driven a significant decrease in silicate concentrations in the study area during the Palmer LTER cruise.

Posted in Uncategorized | Leave a comment