MOSAiC is go!

From the tropics to the Arctic… I spent last week in Tromsø , Norway helping prepare the German icebreaker Polarstern for the MOSAiC year-long polar drift expedition. As I’ve written in past posts, I’ve been waiting for this moment since 2012 and it’s hard to believe it’s finally here. MOSAiC is a true coupled ocean-ice-atmosphere study, and the first such study of its scope or scale. There have been modern overwintering expeditions in the Arctic before – most notably the SHEBA expedition of the late 1990’s – but none have approached the breadth or scale of MOSAiC.

The start of the MOSAiC expedition in Tromsø, Norway.

The basic idea behind MOSAiC is to drive Polarstern into the Laptev Sea and tether the ship to an (increasingly rare) large floe of multiyear sea ice. As we move toward winter, the floe and Polarstern will become encased in newly forming sea ice. The ship will drift with this ice through the full cycle of seasons, allowing a rare opportunity to study the physical, chemical, and biological characteristics of sea ice through its full progression of growth and decay.

The German icebreaker Polarstern tethered to an ice floe in the Arctic. Image from https://www.mosaic-expedition.org/expedition/drift/.

But MOSAiC is about more than sea ice. Sea ice is – for now – a dominant ecological feature of the central Arctic, and it exerts a strong influence on both the atmosphere and the upper ocean. Better predicting the consequences of reduced sea ice cover on these environments is a major goal of the expedition.

With support from the National Science Foundation, for our own little piece of MOSAiC PhD student Emelia Chamberlain and I are collaborating with Brice Loose and postdoctoral researcher Alessandra D’Angelo at the University of Rhode Island, along with colleagues from the Alfred Wegener Institute in Germany. We’ll be looking at how the structure of prokaryotic and eukaryotic communities in sea ice and the upper ocean influence the oxidation of methane (a potent greenhouse gas), and the production and uptake of CO2. I’m looking forward to joining Polarstern in late January for a long, cold stint at the end of the polar night!

Our lab on Polarstern.
We searched in Tromsø for a totem for the lab, but ran a bit short on time and settled for Igor. Trolls are troublesome creatures and not, I think, particularly emblematic of our project team. Cavity ring-down spectrometers and mass specs, however, can be a bit trollish at times. So the totem is for them. Igor will be in charge of our little group of instruments. We can direct our frustrations at him, and hopefully by placating him with offerings we can keep things running smoothly.
The Akademik Federov, a Russian research icebreaker that will sail with Polarstern and help establish the drifting observatory. Federov will return in a few weeks.
Dancing on ice floes. The MOSAiC launch was quite an event with lectures, a party, and a hi-tech light show. The show included an interactive ice floe field – step on the floes and they crack to become open water, slowly freezing after you pass. It was well done.
It’s the Polarstern projected on the Polarstern. So meta.
And they’re off! waving good-bye to the Polarstern.
Posted in MOSAiC | Leave a comment

Perspectives on Ocean Science Lecture

A couple of months ago I was fortunate to have the opportunity to give a lecture at the Birch Aquarium at Scripps in the Perspectives on Ocean Science lecture series. The lecture covered some emerging topics in Arctic Oceanography and provided a brief intro to the upcoming MOSAiC expedition. The lecture was broadcast by UCTV and can be found here. Matthias Wietz – sorry for botching your introduction on the title slide! (Matthias was a PhD student at the Technical University of Denmark when the picture was taken. The record has been set straight.)

Posted in MOSAiC | Leave a comment

OAST visits the South Bay Saltworks

Last week we were busy hosting the inaugural Oceans Across Space and Time (OAST, @Space Oceans OAST on Facebook) combined first year meeting and field effort. It was a crazy week but a huge success. The goal of OAST is to improve life detection efforts on future NASA planetary science missions by better understanding how biomass and activity are distributed in habitats that mimic past or present “ocean worlds”. Ocean worlds is a concept that has gained a lot of traction in the last few years (see our Roadmap to Ocean Worlds synthesis paper here). We have a lot of past or present ocean worlds in our solar system (Earth obviously, but also Mars, Europa, Enceledus, and a whole host of other ice-covered moons), and oceans are seen as a natural feature of planetary bodies that are more likely to host life. Our first year effort focused on some open-ocean training for the Icefin robot, designed for exploring the protected spaces below floating ice shelves, and a multi-pronged investigation of the South Bay Salt Works.

The South Bay Salt Works in Chula Vista, CA. A truly amazing site for exploring how microbial activity and biomass are distributed across environmental gradients.

The Salt Works are an amazing environment that my lab has visited previously (see here and here). Our previous work in this environment has raised more questions than answers, so it was great to hit a few of our favorite spots with a top-notch team of limnologists, microbiologists, geochemists, and engineers.

Part of the OAST team setting up next to some very high salinity NaCl-dominated lakes. The pink color of the lakes is the true color, and is common to high salinity lakes. The color comes from carotenoid pigments in the halophilic archaea that dominate these lakes.
This is what I love about NASA – it’s an agency that develops the most sophisticated technology in the history of human civilization, but isn’t afraid to use a rock when the situation calls for it. Spanning several millennia of technological advancement is Maddie Myers (LSU), with Natalia Erazo (SIO) and Carly Novak (Georgia Tech) in the background.
Carly Novak (Georgia Tech) sampling salts with Peter Doran (LSU) and his “surfboard of science” in the background.
Doug Bartlett (SIO), a little out of his element at only 1 atm.

Posted in OAST, Research | Leave a comment

Training for MOSAiC: Bremerhaven & Utqiagvik

A photo of me with the famous Utqiagvik whale-bone arch, and behind, the Chukchi Sea.

Hello! My name is Emelia Chamberlain and I am a first year PhD student here in the Bowman Lab working on the MOSAiC project. I just got back from a very exciting week in Utqiagvik Alaska for MOSAiC snow and ice training. But first, an overview… As mentioned in an earlier post, the Multidisciplinary drifting Observatory for the Study of Arctic Climate (MOSAiC) project is an international effort to study the Arctic ocean-ice-atmosphere system with the goal of clarifying key climatic and ecological processes as they function in a changing Arctic. Within the larger scope of this project, our lab and collaborators from the University of Rhode Island (URI) will be studying how microbial community structure and ecophysiology control fluxes of oxygen and methane in the central Arctic Ocean.

MOSAiC begins in Sept of 2019, when the German icebreaker RV Polarstern will sail into the Laptev Sea and be tethered to an ice flow. Once trapped in the ice, both ship & scientists will spend the next year drifting through the Arctic. The goal is to set up a central observatory and collect time-series observations across the complete seasonal cycle. This year-long time series will be both exciting and critical for the future of Arctic research, but it is logistically difficult to carry out. The cruise is split up into 6 “legs”, with scientists taking two month shifts collecting observations and living the Arctic life. Resupply will be carried out by other icebreakers and aircraft. I myself will be taking part in the last two legs of this project from June – October 2020, with Jeff, Co-PI Brice Loose (URI), and his post-doc Alessandra D’Angelo (URI) representing our project on the rest of the voyage.

A representation of the central observatory taken from the MOSAiC website

Laboratory training in Bremerhaven, Germany

As one would imagine, with over 600 scientists involved and continuous measurements broken up between multiple teams, this project requires a LOT of advanced planning. However, this is the fun part, as it means we get to travel a lot in preparation! In March, Jeff and I traveled to Potsdam, Germany to participate in a MOSAiC implementation workshop. Shortly after, we took a train up to the Alfred Wegener Institute facilities in Bremerhaven with Brice, Alessandra, and other MOSAiC participants to train on some of the instrumentation we will be operating on the Polarstern. We spent a full week training on instruments like a gas chromatograph, gas-flux measurement chambers, and a membrane inlet mass spectrometer (MIMS). While many of us had operated these types of instruments before, each machine is different and several were engineered or re-designed by participating scientists specifically for MOSAiC.

A specially designed gas-flux chamber for measuring metabolic gas fluxes in both snow and ice. Photo courtesy of Brice Loose (URI)
The AWI engineered MIMS that will be onboard Polarstern. The bubbling chamber ensures precise, daily calibrations (and looks really cool).

The bulk of the training was focused on the MIMS, which will be used to take continuous underway ∆O2/Ar measurements from surface waters during MOSAiC. Water is piped from below the Polarstern and run through the mass spectrometer where dissolved gas concentrations are measured. Argon (Ar), a biologically inert gas, is incorporated into the ocean’s mixed layer at the same rate as oxygen (O2). However, while argon concentrations are evenly distributed, oxygen concentrations are affected by biogeochemical processes (photosynthesis and respiration by biota). We can therefore compare oxygen and argon measurements in the water column to determine how much oxygen has deviated from what we would expect through physical air-sea exchange processes (i.e. deviations from biologic activity). From these oxygen fluxes, we can estimate Net Community Production (NCP), which is defined as the total amount of chemical energy produced by photosynthesis minus that which is used in respiration. This is an important balance to quantify, as it is representative of the amount of carbon removed biologically from the atmosphere (CO2) and sequestered into the ocean pool. The goal is to use these continuous MOSAiC measurements to quantify these biogeochemical budgets through time and get a better understanding of whether the Arctic is net phototrophic or heterotrophic – whether photosynthesis or respiration is the dominant process.  

A behind-the-scenes view of operating the MIMS – photo courtesy of Brice Loose (URI).
Learning how to remove and clean the equilibration tubes These tubes bubble gases into the water for calibration.
PC: Brice Loose (URI)
We will be partially responsible for operating this instrument during our respective legs, and therefore spent a lot of time thinking about what might possibly go wrong during a year on an ice-locked vessel… and how to fix it PC: Brice Loose (URI)

Field training in Utqiagvik, Alaska

Utqiagvik, Alaska (formerly Barrow) is located at the northern tip of Alaska situated between the Chukchi and Beaufort seas. It boasts the northern most point in continental North America.

After a productive week in Bremerhaven, this past week we stepped outside the laboratory with a snow and ice field training session in Utqiagvik, Alaska. One of the challenges of Arctic fieldwork is, of course, that it takes place in the frigid Arctic environment. To help scientists prepare for life on the ice and to help standardize/optimize sampling methods for MOSAiC, there were 3 snow and ice field training sessions organized (the two others took place earlier this year in Finland.) This trip was particularly exciting for me, as it was my first time in the Arctic! Not only did I learn a lot about sampling sea ice but I was struck by the dynamic beauty of the polar landscape. No wonder researchers continue to be fascinated with the unanswered questions of this wild ecosystem.

Up close and personal with a large pressure ridge. Pressure ridges in sea ice are formed when two ice floes collide with each other. You can tell that this ridge was formed from multi-year ice by the thickness of the blocks and their deep blue color. Ice is classified as multi-year when it has survived multiple melt seasons.
Post-doc J.P. Balmonte from Uppsala University meanders his way along the pressure ridge.

The three trainings that everyone had to complete consisted of snow sampling, ice sampling and snow mobile training. Aside from that, people were able to learn or compare more advanced methods for their sampling specialities and test out gear, both scientific and personal weather protection. I was lucky in that the average -18ºC weather we experienced in Utqiagvik will most likely be representative of the type of weather I will be facing in the summer months of MOSAiC. The winter teams will have to contend with quite a bit cooler conditions.

Some days are windier than others and it’s very important to bundle up. However, on this trip I also learned that layers are very important. Working on the ice, especially coring, can be hard work and you don’t want to overheat. Should I need to remove it, beneath my big parka I’ve got on a light puffy jacket, a fleece, and a wool thermal under-layer.
Digging snow-pits is an important aspect for sampling parameters like snow thickness and density. The goal is to get a clear vertical transect of snow to examine depth horizons and sample from. If you look closely, you can see 2 cm thick squares of snow which have been removed from the pit’s wall and weighed before discarding. The wall is built from the snow removed from the working pit and is intended to block researchers from the wind.
Note the meter-stick for snow thickness.
This is a work view I could get used to.
Coring practice! The extension pole between the corer and drill indicate that this is some pretty thick ice. PC: Jeff Bowman
One of the most exciting trainings we had was on how to operate the snow mobiles. These are a critical form of transport on the ice. They often have sleds attached with which to transport gear and samples to and from the ship. As such, we researchers are expected to be able to drive them properly (plus it was pretty fun and allowed us to reach more remote ice locations over our short week in Utquiagvik).
Once out on the ice we practiced tipping the machines over… and how to right them again.
Learning the basics! Note the sled behind ready to be attached to the machine.

While in Utqiagvik, we here at the Bowman Lab decided to make the most of this trip by also collecting some of our own sea-ice cores to sample and experiment with. The goal of our experiment is to determine the best method for melting these cores (necessary for sampling them) while providing the least amount of stress to the resident microbial communities that we are interested in sampling for. I will write up a post covering the methods and ideas behind this experiment soon – but in the meantime, please enjoy this excellent go-pro footage from beneath the ice captured by Jeff during our fieldwork. The brown gunk coating the bottom of the ice is sea-ice algae, mostly made up of diatoms. The ice here is only 68 cm thick allowing for a lot of light penetration and an abundant photosynthetic community. At the end, you can also note the elusive Scientists in their natural sampling habitat.

What’s next?

Jeff looks to the horizon.

As Sept 2019 gets closer, preparations are likely to ramp up even more. Even though I won’t be in the field for another year, it is exciting to think that the start of MOSAiC is rapidly approaching and after these two weeks of training I am feeling much more prepared for the scientific logistics and field challenges that will accompany this research. However, there is still much more to come. In a few weeks I will be jetting off again, but this time to URI to meet up with our collaborators for more instrument training. And thus the preparations continue…

Posted in MOSAiC | Leave a comment

Tutorial: Basic heatmaps and ordination with paprica output

The output from our paprica pipeline for microbial community structure analysis and metabolic inference has changed quite a lot over the last few months. In response to some recent requests here’s a tutorial that walks through an ordination and a few basic plots with the paprica output. The tutorial assumes that you’ve completed this tutorial which runs paprica on the samples associated with our recent seagrass paper.

For our analysis lets bring the pertinent files into R and do some pre-processing:

## read in the edge and unique abundance tables. Note that it's going to take a bit to load the unique_tally file because you have a lot of variables!

tally <- read.csv('2017.07.03_seagrass.bacteria.edge_tally.csv', header = T, row.names = 1)
unique <- read.csv('2017.07.03_seagrass.bacteria.unique_tally.csv', header = T, row.names = 1)

## read in edge_data and taxon_map and seq_edge_map

data <- read.csv('2017.07.03_seagrass.bacteria.edge_data.csv', header = T, row.names = 1)
taxa <- read.csv('2017.07.03_seagrass.bacteria.taxon_map.csv', header = T, row.names = 1, sep = ',', as.is = T)
map <- read.csv('2017.07.03_seagrass.bacteria.seq_edge_map.csv', header = T, row.names = 1)

## convert all na's to 0, then check for low abundance samples

tally[is.na(tally)] <- 0
unique[is.na(unique)] <- 0
rowSums(tally)

## remove any low abundance samples (i.e. bad library builds), and also
## low abundance reads.  This latter step is optional, but I find it useful
## unless you have a particular interest in the rare biosphere.  Note that
## even with subsampling your least abundant reads are noise, so at a minimum
## exclude everything that appears only once.

tally.select <- tally[rowSums(tally) > 5000,]
tally.select <- tally.select[,colSums(tally.select) > 1000]

unique.select <- unique[rowSums(unique) > 5000,]
unique.select <- unique.select[,colSums(unique.select) > 1000]

If your experiment is based on factors (i.e. you want to test for differences between categories of samples) you may want to use DESeq2, otherwise I suggest normalizing by sample abundance.

## normalize

tally.select <- tally.select/rowSums(tally.select)
unique.select <- unique.select/rowSums(unique.select)

Now we’re going to do something tricky. For both unique.select and tally.select, rows are observations and columns are variables (edges or unique reads). Those likely don’t mean much to you unless you’re intimately familiar with the reference tree. We can map the edge numbers to taxa using “taxa” dataframe, but first we need to remove the “X” added by R to make the numbers legal column names. For the unique read labels, we need to split on “_”, which divides the unique read identified from the edge number.

## get edge numbers associated with columns, and map to taxa names.
## If the entry in taxon is empty it means the read could not be classifed below
## the level of the domain Bacteria, and is labeled as "Bacteria"

tally.lab.Row <- taxa[colnames(tally.select), 'taxon']
tally.lab.Row[tally.lab.Row == ""] <- 'Bacteria'

unique.lab.Row <- map[colnames(unique.select), 'global_edge_num']
unique.lab.Row <- taxa[unique.lab.Row, 'taxon']
unique.lab.Row[unique.lab.Row == ""] <- 'Bacteria'
unique.lab.Row[is.na(unique.lab.Row)] <- 'Bacteria'

In the above block of code I labeled the new variables as [tally|unique].lab.Row, because we’ll first use them to label the rows of a heatmap. Heatmaps are a great way to start getting familiar with your data.

## make a heatmap of edge abundance

heat.col <- colorRampPalette(c('white', 'lightgoldenrod1', 'darkgreen'))(100)

heatmap(t(data.matrix(tally.select)),
        scale = NULL,
        col = heat.col,
        labRow = tally.lab.Row,
        margins = c(10, 10))
heatmap(t(data.matrix(unique.select)),
        scale = NULL,
        col = heat.col,
        labRow = unique.lab.Row,
        margins = c(10, 10))

Heatmaps are great for visualizing broad trends in the data, but they aren’t a good entry point for quantitative analysis. A good next step is to carry out some kind of ordination (NMDS, PCoA, PCA, CA). Not all ordination methods will work well for all types of data. Here we’ll use correspondence analysis (CA) on the relative abundance of the unique reads. CA will be carried out with the package “ca”, while “factoextra” will be used to parse the CA output and calculate key additional information. You can find a nice in-depth tutorial on correspondence analysis in R here.

library(ca)
library(factoextra)

unique.select.ca <- ca(unique.select)
unique.select.ca.var <- get_eigenvalue(unique.select.ca)
unique.select.ca.res <- get_ca_col(unique.select.ca)

species.x <- unique.select.ca$colcoord[,1]
species.y <- unique.select.ca$colcoord[,2]

samples.x <- unique.select.ca$rowcoord[,1]
samples.y <- unique.select.ca$rowcoord[,2]

dim.1.var <- round(unique.select.ca.var$variance.percent[1], 1)
dim.2.var <- round(unique.select.ca.var$variance.percent[2], 2)

plot(species.x, species.y,
     ylab = paste0('Dim 2: ', dim.2.var, '%'),
     xlab = paste0('Dim 1: ', dim.1.var, '%'),
     pch = 3,
     col = 'red')

points(samples.x, samples.y,
       pch = 19)

legend('topleft',
       legend = c('Samples', 'Unique reads'),
       pch = c(19, 3),
       col = c('black', 'red'))

At this point you’re ready to crack open the unique.select.ca object and start doing some hypothesis testing. There’s one more visualization, however, that can help with initial interpretation; a heatmap of the top unique edges contributing to the first two dimensions (which account for nearly all of the variance between samples).

species.contr <- unique.select.ca.res$contrib[,1:2]
species.contr.ordered <- species.contr[order(rowSums(species.contr), decreasing = T),]
species.contr.top <- species.contr.ordered[1:10,]

species.contr.lab <- unique.lab.Row[order(rowSums(abs(species.contr)), decreasing = T)]

heatmap(species.contr.top,
        scale = 'none',
        col = heat.col,
        Colv = NA,
        margins = c(10, 20),
        labRow = species.contr.lab[1:10],
        labCol = c('Dim 1', 'Dim 2'),
        cexCol = 1.5)

From this plot we see that quite a few different taxa are contributing approximately equally to Dim 1 (which accounts for much of the variance between samples), including several different Pelagibacter and Rhodobacteracea strains. That makes sense as the dominant environmental gradient in the study was inside vs. outside of San Diego Bay and we would expect these strains to be organized along such a gradient. Dim 2 is different with unique reads associated with Tropheryma whipplei and Rhodoluna lacicola contributing most. These aren’t typical marine strains, and if we look back at the original data we see that these taxa are very abundant in just two samples. These samples are the obvious outliers along Dim 2 in the CA plot.

In this tutorial we covered just the community structure output from paprica, but of course the real benefit to using paprica is its estimation of metabolic potential. These data are found in the *.ec_tally.csv and *path_tally.csv files, and organized in the same way as the edge and unique read abundance tables. Because of this they can be plotted and analyzed in the same way.


Posted in paprica | Leave a comment

Tutorial: Nanopore Analysis Pipeline

Introduction

Hi! I’m Sabeel Mansuri, an Undergraduate Research Assistant for the Bowman Lab at the Scripps Institute of Oceanography, University of California San Diego. The following is a tutorial that demonstrates a pipeline used to assemble and annotate a bacterial genome from Oxford Nanopore MinION data.

This tutorial will require the following (brief installation instructions are included below):

  1. Canu Assembler
  2. Bandage
  3. Prokka
  4. Barrnap
  5. DNAPlotter (alternatively circos)

Software Installation

Canu

Canu is a packaged correction, trimming, and assembly program that is forked from the Celera assembler codebase. Install the latest release by running the following:

git clone https://github.com/marbl/canu.git
cd canu/src
make

Bandage

Bandage is an assembly visualization software. Install it by visiting this link, and downloading the version appropriate for your device.

Prokka

Prokka is a gene annotation program. Install it by visiting this link, and running the installation commands appropriate for your device.

Barrnap

Barrnap is an rRNA prediction software used by Prokka. Install it by visiting this link, and running the installation commands appropriate for your device.

DNAPlotter

DNAPlotter is a gene annotation visualization software. Install it by visiting this link, and running the installation commands appropriate for your device.

Dataset

Download the nanopore dataset located here. This is an isolate from a sample taken from a local saline lake at South Bay Salt Works near San Diego, California.

The download will provide a tarball. Extract it:

tar -xvf nanopore.tar.gz

This will create a runs_fastq folder containing 8 fastq files containing genetic data.

Assembly

Canu can be used directly on the data without any preprocessing. The only additional information needed is an estimate of the genome size of the sample. For the saline isolate, we estimate 3,000,000 base pairs. Then, use the following Canu command to assemble our data:

canu -nanopore_raw -p test_canu -d test_canu runs_fastq/*.fastq genomeSize=3000000 gnuplotTested=true

A quick description of all flags and parameters:

  • -nanopore_raw – specifies data is Oxford Nanopore with no data preprocessing
  • -p – specifies prefix for output files, use “test_canu” as default
  • -d – specifies directory to run test and output files in, use “test_canu” as default
  • genomeSize – estimated genome size of isolate
  • gnuplotTested – setting to true will skip gnuplot testing; gnuplot is not needed for this pipeline

Running this command will output various files into the test_canu directory. The assembled contigs are located in the test.contigs.fasta file. These contigs can be better visualized using Bandage.

Assembly Visualization

Opening Bandage and a GUI window should pop up. In the toolbar, click File > Load Graph, and select the test.contigs.gfa. You should see something like the following:

This graph reveals that one of our contigs appears to be a whole circular chromosome! A quick comparison with the test.contigs.fasta file reveals this is Contig 1. We extract only this sequence from the contigs file to examine further. Note that the first contig takes up the first 38,673 lines of the file, so use head:

head -n38673 test_canu/test_canu.contigs.fasta >> test_canu/contig1.fasta 

NCBI BLAST

We blast this Contig using NCBI’s nucleotide BLAST database (linked here) with all default options. The top hit is:

Hit: Halomonas sp. hl-4 genome assembly, chromosome: I  
Organism: Halomonas sp. hl-4  
Phylogeny: Bacteria/Proteobacteria/Gammaproteobacteria/Oceanospirillales/Halomonadaceae/Halomonas  
Max score: 65370  
Query cover: 72%  
E value: 0.0  
Ident 87%

It appears this chromosome is the genome of an organism in the genus Halomonas. We may now be interested in the gene annotation of this genome.

Gene Annotation

Prokka will take care of gene annotation, the only required input is the contig1.fasta file.

prokka --outdir circular --prefix test_prokka test_canu/contig1.fasta

The newly created circular directory contains various files with data on the gene annotation. Take a look inside test_prokka.txt for a summary of the annotation. We can take a quick look at the annotation using the DNAPlotter GUI.  For a more customized circular plot use circos.

Summary

The analysis above has taken Oxford Nanopore sequenced data, assembled contigs, identified the closest matching organism, and annotated its genome.

Posted in Computer tutorials | 3 Comments

Weighted Gene Correlation Network Analysis (WGCNA) Applied to Microbial Communities

Weighted gene correlation network analysis (WGCNA) is a powerful network analysis tool that can be used to identify groups of highly correlated genes that co-occur across your samples. Thus genes are sorted into modules and these modules can then be correlated with other traits (that must be continuous variables).

Originally created to assess gene expression data in human patients, the authors of the WGCNA method and R package  have a thorough tutorial with in-depth explanations (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/). More recently, the method has been applied to microbial communities (Duran-Pinedo et al., 2011; Aylward et al., 2015; Guidi et al., 2016; Wilson et al., 2018)–the following is a walk though using microbial sequence abundances and environmental data from my 2018 work (https://www.ncbi.nlm.nih.gov/m/pubmed/29488352/).

Background: WGCNA finds how clusters of genes (or in our case abundances of operational taxonomic units–OTUs) correlates with traits (or in our case environmental variables or biochemical rates) using hierarchical clusters, novel applications of weighted adjacency functions and topological overlap measures, and a dynamic tree cutting method.

Very simply, each OTU is going to be represented by a node in a vast network and the adjacency (a score between 0 and 1) between each set of nodes will be calculated. Many networks use hard-thresholding (where a connection score [e.g. a Pearson Correlation Coefficient] between any two nodes is noted as 1 if it is above a certain threshold and noted as 0 if it is below it). This ignores the actual strength of the connection so WGCNA constructs a weighted gene (or OTU) co-occurrence adjacency matrix in lieu of ‘hard’ thresholding. Because our original matrix has abundance data the abundance of each OTU is also factored in.

For this method to work you also have to select a soft thresholding power (sft) to which each co-expression similarity is raised in order to make these scores “connection strengths”. I used a signed adjacency function:

  • Adjacency = 0.5*(1+Pearson correlation)^sft

because it preserves the sign of the connection (whether nodes are positively or negatively correlated) and this is recommendation by authors of WGCNA.

You pick your soft thresholding value by using a scale-free topology. This is based on the idea that the probability that a node is connected with k other nodes decays as a power law:

  • p(k)~ k^(-γ)

This idea is linked to the growth of networks–new nodes are more likely to be attached to already established nodes. In general, scale-free networks display a high degree of tolerance against errors (Zhang & Horvath, 2005).

You then turn your adjacency matrix into a Topological Overlap Measure (TOM) to minimize the effects of noise and spurious associations. A topological overlap of two nodes also factors in all of their shared neighbors (their relative interrelatedness)–so you are basically taking a simple co-occurrence between two nodes and placing it in the framework of the entire network by factoring in all the other nodes each is connected to. For more information regarding adjacency matrices and TOMs please see Zhang & Horvath (2005) and Langfelder & Horvath (2007 & 2008).

Start: Obtain an OTU abundance matrix (MB.0.03.subsample.fn.txt) and environmental data (OxygenMatrixMonterey.csv).

The OTU abundance matrix simply has all the different OTUs that were observed in a bunch of different samples (denoted in the Group column; e.g. M1401, M1402, etc.). These OTUs represent 16S rRNA sequences that were assessed with the universal primers 515F-Y (5′-GTGYCAGCMGCCGCGGTAA) and 926R (5′-CCGYCAATTYMTTTRAGTTT) and were created using a 97% similarity cutoff. These populations were previously subsampled to the smallest library size and all processing took place in mothur (https://www.mothur.org/). See Wilson et al. (2018) for more details.

The environmental data matrix tells you a little bit more about the different Samples, like the Date of collection, which of two site Locations it was collected from, the Depth or Zone of collection. You also see a bunch of different environmental variables like several different Upwelling indices (for different stations and different time spans), community respiration rate (CR), Oxygen Concentration, and Temperature. Again, see Wilson et al. (2018) for more details.

Code–Initial Stuff:

Read data in:

data<-read.table("MB.0.03.subsample.fn.txt",header=T,na.strings="NA")

For this particular file we have to get rid of first three columns since the OTUs don’t actually start until the 4th column:

data1 = data[-1][-1][-1]

You should turn your raw abundance values into a relative abundance matrix and potentially transform it. I recommend a Hellinger Transformation (a square root of the relative abundance)–this effectively gives low weight to variables with low counts and many zeros. If you wanted you could do the Logarithmic transformation of Anderson et al. (2006) here in stead.

library("vegan", lib.loc="~/R/win-library/3.3")
HellingerData<-decostand(data1,method = "hellinger")

You have to limit the OTUs to the most frequent ones (ones that occur in multiple samples so that you can measure co-occurance across samples). I just looked at my data file and looked for where zeros became extremely common. This was easy because mothur sorts OTUs according to abundance. If you would like a more objective way of selecting the OTUs or if your OTUs are not sorted you then this code may help:

lessdata <- data1[,colSums(data1) > 0.05]

(though you will have to decide what cutoff works best for your data).

Code–Making your relative abundance matrix:

You have to reattach the Group Name column:

RelAbun1 = data.frame(data[2],HellingerData[1:750])

Write file (this step isn’t absolutely necessary, but you may want this file later at some point):

write.table(RelAbun1, file = "MontereyRelAbun.txt", sep="\t")

Code–Visualizing your data at the sample level:

Now load the WGCNA package:

library("WGCNA", lib.loc="~/R/win-library/3.3")

Bring data in:

OTUs<-read.table("MontereyRelAbun.txt",header=T,sep="\t")
dim(OTUs);
names(OTUs);

Turn the first column (sample name) into row names (so that only OTUs make up actual columns):

datExpr0 = as.data.frame((OTUs[,-c(1)]));
names(datExpr0) = names(OTUs)[-c(1)];
rownames(datExpr0) = OTUs$Group;

Check Data for excessive missingness:

gsg = goodSamplesGenes(datExpr0[-1], verbose = 3);
gsg$allOK

You should get TRUE for this dataset given the parameters above. TRUE means that all OTUs have passed the cut. This means that when you limited your OTUs to the most common ones above that you didn’t leave any in that had too many zeros. It is still possible that you were too choosy though. If you got FALSE for your data then you have to follow some other steps that I don’t go over here.

Cluster the samples to see if there are any obvious outliers:

sampleTree = hclust(dist(datExpr0), method = "average");

sizeGrWindow(12,9)

par(cex = 0.6);
par(mar = c(0,4,2,0))

plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)

The sample dendrogram doesn’t show any obvious outliers so I didn’t remove any samples. If you need to remove some samples then you have to follow some code I don’t go over here.

Now read in trait (Environmental) data and match with expression samples:

traitData = read.csv("OxygenMatrixMonterey.csv");
dim(traitData)
names(traitData)

Form a data frame analogous to expression data (relative abundances of OTUs) that will hold the Environmental traits:

OTUSamples = rownames(datExpr0);
traitRows = match(OTUSamples, traitData$Sample);
datTraits = traitData[traitRows, -1];
rownames(datTraits) = traitData[traitRows, 1];
collectGarbage()

Outcome: Now your OTU expression (or abundance) data are stored in the variable datExpr0 and the corresponding environmental traits are in the variable datTraits. Now you can visualize how the environmental traits relate to clustered samples.

Re-cluster samples:

sampleTree2 = hclust(dist(datExpr0), method = "average")

Convert traits to a color representation: white means low, red means high, grey means missing entry:

traitColors = numbers2colors(datTraits[5:13], signed = FALSE);

Plot the sample dendrogram and the colors underneath:

plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits[5:13]),
main = "Sample dendrogram and trait heatmap")

Again: white means a low value, red means a high value, and gray means missing entry. This is just initial stuff… we haven’t looked at modules of OTUs that occur across samples yet.

Save:

save(datExpr0, datTraits, file = "Monterey-dataInput.RData")

Code–Network Analysis:

Allow multi-threading within WGCNA. This helps speed up certain calculations.
Any error here may be ignored but you may want to update WGCNA if you see one.

options(stringsAsFactors = FALSE);
enableWGCNAThreads()

Load the data saved in the first part:

lnames = load(file = "Monterey-dataInput.RData");

The variable lnames contains the names of loaded variables:

lnames

Note: You have a couple of options for how you create your weighted OTU co-expression network. I went with the step-by-step construction and module detection. Please see this document for information on the other methods (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/Simulated-05-NetworkConstruction.pdf).

Choose a set of soft-thresholding powers:

powers = c(c(1:10), seq(from = 11, to=30, by=1))

Call the network topology analysis function:
Note: I am using a signed network because it preserves the sign of the connection (whether nodes are positively or negatively correlated); this is recommendation by authors of WGCNA.

sft = pickSoftThreshold(datExpr0, powerVector = powers, verbose = 5, networkType = "signed")

Output:

pickSoftThreshold: will use block size 750.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 750 of 750
   Power SFT.R.sq slope truncated.R.sq  mean.k. median.k. max.k.
1      1   0.0299  1.47          0.852 399.0000  400.0000 464.00
2      2   0.1300 -1.74          0.915 221.0000  221.0000 305.00
3      3   0.3480 -2.34          0.931 128.0000  125.0000 210.00
4      4   0.4640 -2.41          0.949  76.3000   73.1000 150.00
5      5   0.5990 -2.57          0.966  47.2000   44.0000 111.00
6      6   0.7010 -2.52          0.976  30.1000   27.1000  83.40
7      7   0.7660 -2.47          0.992  19.8000   17.2000  64.30
8      8   0.8130 -2.42          0.986  13.3000   11.0000  50.30
9      9   0.8390 -2.34          0.991   9.2200    7.1900  40.00
10    10   0.8610 -2.24          0.992   6.5200    4.8800  32.20
11    11   0.8670 -2.19          0.987   4.7000    3.3700  26.20
12    12   0.8550 -2.18          0.959   3.4600    2.3300  21.50

This is showing you the power (soft thresholding value), the r2 for the scale independence for each particular power (we shoot for an r2 higher than 0.8), the mean number of connections each node has at each power (mean.k), the median number of connections/node (median.k), and the maximum number of connections (max.k).

Plot the results:

sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;

Scale-free topology fit index (r2) as a function of the soft-thresholding power:

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");

This line corresponds to using an R^2 cut-off of h:

abline(h=0.8,col="red")

Mean connectivity as a function of the soft-thresholding power:

plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

I picked a soft thresholding value of 10 because it was well above an r2 of 0.8 (it is a local peak for the r2) and the mean connectivity is still above 0.

So now we just calculate the adjacencies, using the soft thresholding power of 10:

softPower = 10;
adjacency = adjacency(datExpr0, power = softPower, type = "signed");

Then we transform the adjacency matrix into a Topological Overlap Matrix (TOM) and calculate corresponding dissimilarity:

Remember: The TOM you calculate shows the topological similarity of nodes, factoring in the connection strength two nodes share with other “third party” nodes. This will minimize effects of noise and spurious associations:

TOM = TOMsimilarity(adjacency, TOMType = "signed");
dissTOM = 1-TOM

Create a dendogram using a hierarchical clustering tree and then call the hierarchical clustering function:

TaxaTree = hclust(as.dist(dissTOM), method = "average");

Plot the resulting clustering tree (dendrogram):

sizeGrWindow(12,9)
plot(TaxaTree, xlab="", sub="", main = "Taxa clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04);

This image is showing us the clustering of all 750 OTUs based on the TOM dissimilarity index.

Now you have to decide the optimal module size for your system and should play around with this value a little. I wanted relatively large module so I set the minimum module size relatively high at 30:

minModuleSize = 30;

Module identification using dynamic tree cut (there a couple of different ways to figure out your modules and so you should explore what works best for you in the tutorials by the authors):

dynamicMods = cutreeDynamic(dendro = TaxaTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize);
table(dynamicMods)

Convert numeric labels into colors:

dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
dynamicColors
    black      blue     brown     green       red turquoise    yellow 
       49       135       113        71        64       216       102

You can see that there are a total of 7 modules (you should have seen that above too) and that now each module is named a different color. The numbers under the colors tells you how many OTUs were sorted into that module. Each OTU is in exactly 1 module, and you can see that if you add up all of the numbers from the various modules you get 750 (the number of OTUs that we limited our analysis to above).

Plot the dendrogram with module colors underneath:

sizeGrWindow(8,6)
plotDendroAndColors(TaxaTree, dynamicColors, "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Taxa dendrogram and module colors")

Now we will quantify co-expression similarity of the entire modules using eigengenes and cluster them based on their correlation:
Note: An eigengene is 1st principal component of a module expression matrix and represents a suitably defined average OTU community.

Calculate eigengenes:

MEList = moduleEigengenes(datExpr0, colors = dynamicColors)
MEs = MEList$eigengenes

Calculate dissimilarity of module eigengenes:

MEDiss = 1-cor(MEs);

Cluster module eigengenes:

METree = hclust(as.dist(MEDiss), method = "average");

Plot the result:

sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")

Now we will see if any of the modules should be merged. I chose a height cut of 0.30, corresponding to a similarity of 0.70 to merge:

MEDissThres = 0.30

Plot the cut line into the dendrogram:

abline(h=MEDissThres, col = "red")

You can see that, according to our cutoff, none of the modules should be merged.

If there were some modules that needed to be merged you can call an automatic merging function:

merge = mergeCloseModules(datExpr0, dynamicColors, cutHeight = MEDissThres, verbose = 3)

The merged module colors:

mergedColors = merge$colors;

Eigengenes of the new merged modules:

mergedMEs = merge$newMEs;

If you had combined different modules then that would show in this plot:

sizeGrWindow(12, 9)

plotDendroAndColors(TaxaTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)

If we had merged some of the modules that would show up in the Merged dynamic color scheme.

Rename the mergedColors to moduleColors:

moduleColors = mergedColors

Construct numerical labels corresponding to the colors:

colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)-1;
MEs = mergedMEs;

Save module colors and labels for use in subsequent parts:

save(MEs, moduleLabels, moduleColors, TaxaTree, file = "Monterey-networkConstruction-stepByStep.RData")

Code–Relating modules to external information and IDing important taxa:

Here you are going to identify modules that are significantly associate with environmental traits/biogeochemical rates. You already have summary profiles for each module (eigengenes–remeber that an eigengene is 1st principal component of a module expression matrix and represents a suitably defined average OTU community), so we just have to correlate these eigengenes with environmental traits and look for significant associations.

Defining numbers of OTUs and samples:

nTaxa = ncol(datExpr0);
nSamples = nrow(datExpr0);

Recalculate MEs (module eigengenes):

MEs0 = moduleEigengenes(datExpr0, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

Now we will visualize it:

sizeGrWindow(10,6)

textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));

labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))

Each row corresponds to a module eigengene and each column corresponds to an environmental trait or biogeochemical rate (as long as it is continuous–notice that the categorical variables are gray and say NA). Each cell contains the corresponding Pearson correlation coefficient (top number) and a p-value (in parentheses). The table is color-coded by correlation according to the color legend.

You can see that the Brown module is positively correlated with many indices of upwelling while the Black module is negatively correlated with many indices of upwelling. For this work I was particularly interested in CR and so I focused on modules the positively or negatively correlated with CR. The Red module was negatively associated with CR while the Blue module was positively associated with CR.

Let’s look more at the Red module by quantifying the associations of individual taxa with CR:

First define the variable we are interested in from datTrait:

CR = as.data.frame(datTraits$CR);
names(CR) = "CR"

modNames = substring(names(MEs), 3)
TaxaModuleMembership = as.data.frame(cor(datExpr0, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(TaxaModuleMembership), nSamples));
names(TaxaModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
TaxaTraitSignificance = as.data.frame(cor(datExpr0, CR, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(TaxaTraitSignificance), nSamples));
names(TaxaTraitSignificance) = paste("GS.", names(CR), sep="");
names(GSPvalue) = paste("p.GS.", names(CR), sep="");

module = "red"
column = match(module, modNames);
moduleTaxa = moduleColors==module;
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(TaxaModuleMembership[moduleTaxa, column]),
abs(TaxaTraitSignificance[moduleTaxa, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Taxa significance for CR",
main = paste("Module membership vs. Taxa significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

This graph shows you how each taxa (each red dot is an OTU that belongs in the Red module) correlated with 1) the Environmental trait of interest and 2) how important it is to the module. The taxa/OTUs that have high module membership tend to occur whenever the module is represented in the environment and are therefore often connected throughout the samples with other red taxa/OTUs. In this module, these hubs (Red OTUs that occur with other Red OTUs) are also the most important OTUs for predicting CR.

Now lets get more info about the taxa that make up the Red module:

First, merge the statistical info from previous section (modules with high assocation with trait of interest–e.g. CR or Temp) with taxa annotation and write a file that summarizes these results:

names(datExpr0)
names(datExpr0)[moduleColors=="red"]

You will have to feed in an annotation file–a file listing what Bacteria/Archaea go with each OTU (I am not providing you will this file, but it just had a column with OTUs and a column with the Taxonomy).

annot = read.table("MB.subsample.fn.0.03.cons.taxonomy",header=T,sep="\t");
dim(annot)
names(annot)
probes = names(datExpr0)
probes2annot = match(probes, annot$OTU)

Check for the number or probes without annotation (it should return a 0):

sum(is.na(probes2annot))

Create the starting data frame:

TaxaInfo0 = data.frame(Taxon = probes,
TaxaSymbol = annot$OTU[probes2annot],
LinkID = annot$Taxonomy[probes2annot],
moduleColor = moduleColors,
TaxaTraitSignificance,
GSPvalue)

Order modules by their significance for weight:

modOrder = order(-abs(cor(MEs, CR, use = "p")));

Add module membership information in the chosen order:

for (mod in 1:ncol(TaxaModuleMembership))
{
oldNames = names(TaxaInfo0)
TaxaInfo0 = data.frame(TaxaInfo0, TaxaModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(TaxaInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}

Order the OTUs in the geneInfo variable first by module color, then by geneTraitSignificance:

TaxaOrder = order(TaxaInfo0$moduleColor, -abs(TaxaInfo0$GS.CR));
TaxaInfo = TaxaInfo0[TaxaOrder, ]

Write file:

write.csv(TaxaInfo, file = "TaxaInfo.csv")

Here is a bit of the output file I got:

Taxon TaxaSymbol LinkID moduleColor GS.TotalRate p.GS.TotalRate MM.red p.MM.red MM.blue p.MM.blue MM.green p.MM.green MM.brown p.MM.brown MM.turquoise p.MM.turquoise MM.black p.MM.black MM.yellow p.MM.yellow
Otu00711 Otu00711 Otu00711 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);Surface_4(100); black 0.461005 0.00111 0.005028 0.973244 0.243888 0.098526 -0.07719 0.606075 -0.25274 0.086535 0.058878 0.694233 0.502027 0.000324 0.132412 0.374947
Otu00091 Otu00091 Otu00091 Bacteria(100);Bacteroidetes(100);Flavobacteriia(100);Flavobacteriales(100);Flavobacteriaceae(100);Formosa(100); black 0.378126 0.008778 -0.17243 0.246462 0.446049 0.001676 0.34467 0.017667 -0.55057 6.08E-05 0.492517 0.000437 0.615168 4.20E-06 0.367211 0.011115
Otu00082 Otu00082 Otu00082 Bacteria(100);Bacteroidetes(100);Flavobacteriia(100);Flavobacteriales(100);Flavobacteriaceae(100);NS5_marine_group(100); black -0.35649 0.013911 0.222515 0.132755 -0.06428 0.667734 0.175654 0.237601 -0.45502 0.001312 0.421756 0.003151 0.750195 1.28E-09 0.126362 0.397349
Otu00341 Otu00341 Otu00341 Bacteria(100);Bacteroidetes(100);Cytophagia(100);Cytophagales(100);Flammeovirgaceae(100);Marinoscillum(100); black -0.28242 0.054435 0.023927 0.873162 -0.07555 0.613762 0.144688 0.331879 -0.03144 0.833838 0.035147 0.814565 0.209255 0.158058 -0.06565 0.661083
Otu00537 Otu00537 Otu00537 Bacteria(100);Verrucomicrobia(100);Verrucomicrobiae(100);Verrucomicrobiales(100);Verrucomicrobiaceae(100);Persicirhabdus(100); black -0.23668 0.109211 0.123673 0.40755 -0.17362 0.243171 -0.05738 0.701628 -0.26399 0.072961 0.264411 0.072493 0.425082 0.002897 0.040045 0.789278
Otu00262 Otu00262 Otu00262 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);Surface_1(100);Candidatus_Pelagibacter(90); black -0.23615 0.110023 0.327396 0.02468 -0.22748 0.12411 -0.13779 0.355699 -0.23708 0.108594 0.271968 0.064409 0.554592 5.23E-05 0.036113 0.809563
Otu00293 Otu00293 Otu00293 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);SAR11_clade(100);SAR11_clade_unclassified(100); black 0.223427 0.131133 0.142016 0.34098 0.209327 0.157912 0.234713 0.112274 -0.53032 0.000126 0.529907 0.000128 0.627937 2.30E-06 0.390714 0.006621
Otu00524 Otu00524 Otu00524 Bacteria(100);Actinobacteria(100);Acidimicrobiia(100);Acidimicrobiales(100);OM1_clade(100);Candidatus_Actinomarina(100); black -0.20559 0.165629 0.28312 0.053809 0.016758 0.910982 0.148756 0.318316 -0.34758 0.016669 0.376043 0.009188 0.494903 0.000406 0.377597 0.00888
Otu00370 Otu00370 Otu00370 Bacteria(100);Verrucomicrobia(100);Opitutae(100);MB11C04_marine_group(100);MB11C04_marine_group_unclassified(100); black -0.20397 0.169074 0.303984 0.037771 -0.06655 0.656707 0.009401 0.949991 -0.25451 0.084275 0.097595 0.514 0.642409 1.13E-06 0.224711 0.128875

 

NOTES on output:

moduleColor is the module that the OTU was ultimately put into

GS stands for Gene Significance (for us it means taxon significance) while MM stands for module membership.

GS.Environmentaltrait = Pearson Correlation Coefficient for that OTU with the trait. GS allows incorporation of external info into the co-expression network by showing gene/OTU significance. The higher the absolute value of GS the more biologically significant the gene (or in our case taxa) to that external variable (e.g. CR).
p.GS.Environmentaltrait = P value for the preceding relationship.

MM.color = Pearson Correlation Coefficient for Module Membership–i.e. how well that OTU correlates with that particular color module (each OTU has a value for each module but only belongs to one module). If close to 0 or negative then the taxa is not part of that color module (since each OTU has to be put in a module you may get some OTUs that are close to 0, but they aren’t important to that module). If it is close to 1 then it is highly connected to that color module, but will be placed with the color module that it is most connected to throughout the samples.
p.MM.color = P value for the preceding relationship.

Modules will be ordered by their significance for the external variable you selected (e.g. CR), with the most significant ones to the left.
Each of the modules (with each OTU assigned to exactly one module) will be represented for the environmental trait you selected.
You will have to rerun this for each environmental trait you are interested in.

 

Posted in Computer tutorials | 3 Comments

Postdoctoral researcher sought in bioinformatics/predictive analytics

We have an opening for a postdoctoral scholar in the area of bioinformatics and predictive analytics.  The ideal candidate will have demonstrated expertise in one of these two areas and a strong interest in the other.  Possible areas of expertise within predictive analytics include the development and application of neural networks and other machine learning techniques, or nonlinear statistical modeling techniques such as empirical dynamical modeling.  Possible areas of expertise within bioinformatics include genome assembly and annotation, metagenomic or metatranscriptomic analysis, or advanced community structure analysis.  Candidates should be proficient with R, Python, or Matlab, familiar with scientific computing in Linux, and fluent in English.

The successful candidate will take the lead in an exciting new industry collaboration to predict process thresholds from changes in microbial community structure.  The goal of this collaboration is not only to predict thresholds, but to fully understand the underlying ecosystem dynamics through genomics and metabolic modeling.  The candidate will be in residence at Scripps Institution of Oceanography at UC San Diego, but will have the opportunity to work directly with industry scientists in the San Diego area.  The candidate is expected to take on a leadership role in the Bowman Lab, participate in training graduate students and undergraduates, represent the lab at national and international meetings, and publish their work in scholarly journals.  The initial opportunity is for two years with an option for a third year contingent on progress made.  The position starts January of 2019.

Interested candidates should send a CV and a brief, 1-page statement of interest to Jeff Bowman at jsbowman at ucsd.edu.  The position will remain open until filled, but interested candidates are encouraged to apply by 11/16 for first consideration.

Posted in Job opportunities | Leave a comment

Separating bacterial and archaeal reads for analysis with paprica

One of the most popular primer sets for 16S rRNA gene amplicon analysis right now is the 515F/806R set. One of the advantages of this pair is that it amplifies broadly across the domains Archaea and Bacteria. This reduces by half the amount of work required to characterize prokaryotic community structure, and allows a comparison of the relative (or absolute, if you have counts) abundance of bacteria and archaea.  However, paprica and many other analysis tools aren’t designed to simultaneously analyze reads from both domains.  Different reference alignments or covariance models, for example, might be required.  Thus it’s useful to split an input fasta file into separate bacterial and archaeal files.

We like to use the Infernal tool cmscan for this purpose.  First, you’ll need to acquire covariance models for the 16S/18S rRNA genes from all three domains of life.  You can find those on the Rfam website, they are also included in paprica/models if you’ve downloaded paprica.  Copy the models to new subdirectory in your working directory while combining them into a single file:

mkdir cms
cat ../paprica/models/*cm > cms/all_domains.cm
cd cms

Now you need to compress and index the covariance models using the cmpress utility provided by Infernal.  This takes a while.

cmpress all_domains.cm

Pretty simple.  Now you’re ready to do some work.  The whole Infernal suite of tools has pretty awesome built-in parallelization, but with only three covariance models in the file you won’t get much out of it.  Best to minimize cmscan’s use of cores and instead push lots of files through it at once.  This is easily done with the Gnu Parallel command:

ls *.fasta | parallel -u cmscan --cpu 1 --tblout {}.txt cmscan/all_domains.cm {} > /dev/nul

Next comes the secret sauce.  The command above produces an easy-to-parse, easy-to-read table with classification stats for each of the covariance models that we searched against.  Paprica contains a utility in paprica/utilities/pick_16S_domain.py to parse the table and figure out which model scored best for each read, then make three new fasta files for each of domains Bacteria, Archaea, and Eukarya (the primers will typically pick up a few euks).  We’ll parallelize the script just as we did for cmscan.

ls *.fasta | parallel -u python pick_16S_domain_2.py -prefix {} -out {}

Now you have domain-specific files that you can analyze in paprica or your amplicon analysis tool of choice!

 

 

Posted in paprica | 2 Comments

Tutorial: How to make a map using QGIS

Hi! I’m Natalia Erazo, currently working on the Ecuador project aimed at examining biogeochemical processes in mangrove forest. In this tutorial, we’ll learn the basics of (free) QGIS, how to import vector data, and make a map using data obtained from our recent field trip to the Ecological Reserve Cayapas Mataje in Ecuador!  We’ll also learn standard map elements and QGIS function: Print Composer to generate a map.

Objectives:

I. Install QGIS

II. Learn how to upload raster data using the Plugin OpenLayers and QuickMap services.

III. Learn how to import vector data: import latitude, longitude data and additional data. Learn how to select attributes from the data e.g., salinity values and plot them.

IV. Make a map using Print Composer in QGIS.

I. QGIS- Installation

QGIS is a very powerful tool and user friendly open source geographical system that runs on linux, unix, mac, and windows. QGIS can be downloaded here . You should follow the instructions and install gdal complete.pkg, numpy.pkg, matplotlib.pkg, and qgis.pkg.

II.Install QGIS Plug-in and Upload a base map.

  1. Install QGIS Plug-in

Go to Plugins and select Manage and Install plugins. This will open the plugins dialogue box and type OpenLayers Plugin and click on Install plugin.

This plugin will give you access to Google Maps, openStreet map layers and others, and it is very useful to make quick maps from Google satellite, physical, and street layers. However, the OpenLayers plugin could generate zoom errors in your maps.   There is another plug in: Quick Map Service which uses tile servers and not the direct api for getting Google layers and others. This is a very useful plugin which offers more options for base maps and less zoom errors. To install it you should follow the same steps as you did for the OpenLayers plugin except this time you’ll type QuickMap Service and install the plugin.

Also, If you want to experiment with QuickMap services you can expand the plugin: Go to Web->Quick Map Services->Settings->More services and click on get contributed pack. This will generate more options for mapping.

2. Add the base layer Map:

I recommend playing with the various options in either OpenLayers like the Google satellite, physical, and other maps layers, or QuickMap Service.

For this map, we will use ESRI library from QuickMap services. Go to–> Web- ->QuickMapServices–> Esri–> ESRI Satellite

You should see your satellite map.

You can click on the zoom in icon to adjust the zoom, as shown in the map below where I  zoom in the Galapagos Islands. You’ll also notice that on the left side you have a Layers panel box, this box shows all the layers you add to your map. Layers can be raster data or vector data, in this case we see the layer: ESRI Satellite. At the far left you’ll see a list of icons that are used to import your layers. It is important to know what kind of data you are importing to QGIS to use the correct function.

III. Adding our vector data.

We will now add our data file which contains latitude and longitude of all the sites we collected samples, in addition to values for salinity, temperature, and turbidity. You can do this with your own data by creating a file in excel  and have a column with longitude and latitude values and columns with other variables  and save it as a csv file. To input data you’ll go to the icons on the far left and click on “Add  Delimited Text Layer”. Or you can click on Layer-> Add Layer-> Add Delimited Text Layer.

You’ll browse to the file with your data. Make sure that csv is selected for File format. Additionally, make sure that X field represents the column for your longitude points and Y field for latitude. QGIS is smart enough to recognize longitude and latitude columns but double check! You can also see an overview of the data with columns for latitude, longitude, Barometer mmHg, conductivity, Salinity psu and other variables. You can leave everything else as default and click ok.

You’ll be prompt to select the coordinate reference system selector, and this is very important because if you do not select the right one you’ll get your points in the wrong location. For GPS coordinates, as the data we are using here, you need to select WGS 84 ESPG 43126.

Now we can see all the points where we collected data!

As we saw earlier, the data contains environmental measurements such as: salinity, turbidity, temperature and others. We can style the layer with our sampling points based on the variables of our data. In this example we will  create a layer representing salinity values.

You’ll right click on the layer with our data in the Layer Panel, in this case our layer: 2017_ecuador_ysi_dat.. and select properties.

The are many styles you can choose for the layer and the styling options are located in the Style tab of the Properties dialogue. Clicking on the drop-down bottom in the Style dialogue, you’ll see there are five options available: Single Symbol, Categorized, Graduated, Rule Based and Point displacement. We’ll use Graduated which allows you to break down the data in unique classes. Here we will use the salinity values and will classify them into 3 classes: low, medium, and high salinity. There are 5 modes available in the Graduated style to do this: Equal interval, Quantile, Natural breaks, Standard deviation and Pretty breaks. You can read more about these options in qgis documentation.

In this tutorial, for simplicity  we’ll use the Quantile option. This method will decide the classes such that number of values in each class are the same; for example, if there are 100 values and we want 4 classes, the quantile method decide the classes such that each class will have 25 values.

In the Style section: Select->Graduated, in Column->salinity psu, and in color ramp we’ll do colors ranging from yellow to red.

In the classes box write down 3 and  select mode–>Quantile. Click on classify, and QGIS will classify your values in different ranges.

Now we have all the data points color in the 3 different ranges: low, medium, and high salinity.

However, we have a lot of points and it is hard to visualize the data points. We can edit the points by right clicking on the marker points and select edit symbol.

Now, I am going to get rid of the black outline to make the points easy to visualize. Select the point by clicking on Simple Marker and in Outline style select the No Pen. Do the same for the remaining two points.

Nice, now we can better see variations in our points based on salinity!

IV. Print Composer: making a final map

We can start to assemble the final version of our  map. QGIS has the option to create a Print composer where you can edit your map. Go to Project -> New Print composer

You will be prompted to enter a title for the composer, enter the title name and hit ok. You will be taken to the Composer window.

In the Print composer window, we want to bring the map view that we see in the QGIS canvas to the composer. Go to Layout-> Add a Map. Once the Add map button is active, hold the left mouse and drag a rectangle where you want to insert the map. You will see that the rectangle window will be rendered with the map from the main QGIS canvas.

You can see in the far right end the Items box; this  shows you the map you just added. If you want to make changes, you’ll select the map and edit it under item properties. Sometimes it is useful to edit the scale until you are happy with the map.

We can also add a second map of the location of Cayapas Mataje in South America as a  geographic reference. Go to the main qgis canvas and zoom out the map until you can see where in South America the reserve is located.

Now go back to Print Composer and add the map of  the entire region. You’ll do the same as with the first map. Go to Layout–> Add map. Drag a rectangle where you want to insert the map. You will see that the rectangle window will be rendered with the map from the main QGIS canvas. In Items box, you can see you have Map 0 and Map 1. Select Map 1, and add a frame under Item properties, click on Frame to activate it and adjust the thickness to 0.40mm.

We can add a North arrow to the map. The print composer comes with a collection of map related images including many North arrows. Click layout–> add image.

Hold on the left mouse button, draw a rectangle on the top-right corner of the map canvas.

On the right-hand panel, click on the Item Properties tab and expand the Search directories and select the north arrow image you like the most. Once you’ve selected your image, you can always edit the arrow under SVG parameters.

Now we’ll add a scale bar. Click on Layout–> Add a Scale bar. Click on the layout where you want the scale bar to appear. Choose the Style and units that fit your requirement. In the Segments panel, you can adjust the number of segments and their size. Make sure Map 0 is selected under main properties.

I’ll add a legend to the map. Go to Layout–> add a Legend. Hold on the left mouse button, and draw a rectangle on the area you want the legend to appear. You can make any changes such as adding a title in the item properties, changing fonts and renaming your legend points by clicking on them and writing the text you want.

It’s time to label our map. Click on Layout ‣ Add Label. Click on the map and draw a box where the label should be. In the Item Properties tab, expand the Label section and enter the text as shown below. You can also make additional changes to your font, size by editing the label under Appearance.

Once you have your final version, you can export it as Image, PDF or SVG. For this tutorial, let’s export it as an image. Click Composer ‣ Export as Image.

Here is our final map!

Now you can try the tutorial with your own data. Making maps is always a bit challenging but put your imagination to work!

Here is a list of links that could help with QGIS:

-QGIS blog with various tutorials and new info on functions to use: here.

-If you want more information on how QGIS handles symbol and vector data styling: here  is a good tutorial.

-If you need data, a good place to start is Natural Earth: Free vector and raster basemap data used for almost any cartographic endeavor.

If you have specific questions please don’t hesitate to ask.

 

Posted in Computer tutorials | 20 Comments