Tutorial: Basic heatmaps and ordination with paprica output

The output from our paprica pipeline for microbial community structure analysis and metabolic inference has changed quite a lot over the last few months. In response to some recent requests here’s a tutorial that walks through an ordination and a few basic plots with the paprica output. The tutorial assumes that you’ve completed this tutorial which runs paprica on the samples associated with our recent seagrass paper.

For our analysis lets bring the pertinent files into R and do some pre-processing:

## read in the edge and unique abundance tables. Note that it's going to take a bit to load the unique_tally file because you have a lot of variables!

tally <- read.csv('2017.07.03_seagrass.bacteria.edge_tally.csv', header = T, row.names = 1)
unique <- read.csv('2017.07.03_seagrass.bacteria.unique_tally.csv', header = T, row.names = 1)

## read in edge_data and taxon_map and seq_edge_map

data <- read.csv('2017.07.03_seagrass.bacteria.edge_data.csv', header = T, row.names = 1)
taxa <- read.csv('2017.07.03_seagrass.bacteria.taxon_map.csv', header = T, row.names = 1, sep = ',', as.is = T)
map <- read.csv('2017.07.03_seagrass.bacteria.seq_edge_map.csv', header = T, row.names = 1)

## convert all na's to 0, then check for low abundance samples

tally[is.na(tally)] <- 0
unique[is.na(unique)] <- 0
rowSums(tally)

## remove any low abundance samples (i.e. bad library builds), and also
## low abundance reads.  This latter step is optional, but I find it useful
## unless you have a particular interest in the rare biosphere.  Note that
## even with subsampling your least abundant reads are noise, so at a minimum
## exclude everything that appears only once.

tally.select <- tally[rowSums(tally) > 5000,]
tally.select <- tally.select[,colSums(tally.select) > 1000]

unique.select <- unique[rowSums(unique) > 5000,]
unique.select <- unique.select[,colSums(unique.select) > 1000]

If your experiment is based on factors (i.e. you want to test for differences between categories of samples) you may want to use DESeq2, otherwise I suggest normalizing by sample abundance.

## normalize

tally.select <- tally.select/rowSums(tally.select)
unique.select <- unique.select/rowSums(unique.select)

Now we’re going to do something tricky. For both unique.select and tally.select, rows are observations and columns are variables (edges or unique reads). Those likely don’t mean much to you unless you’re intimately familiar with the reference tree. We can map the edge numbers to taxa using “taxa” dataframe, but first we need to remove the “X” added by R to make the numbers legal column names. For the unique read labels, we need to split on “_”, which divides the unique read identified from the edge number.

## get edge numbers associated with columns, and map to taxa names.
## If the entry in taxon is empty it means the read could not be classifed below
## the level of the domain Bacteria, and is labeled as "Bacteria"

tally.lab.Row <- taxa[colnames(tally.select), 'taxon']
tally.lab.Row[tally.lab.Row == ""] <- 'Bacteria'

unique.lab.Row <- map[colnames(unique.select), 'global_edge_num']
unique.lab.Row <- taxa[unique.lab.Row, 'taxon']
unique.lab.Row[unique.lab.Row == ""] <- 'Bacteria'
unique.lab.Row[is.na(unique.lab.Row)] <- 'Bacteria'

In the above block of code I labeled the new variables as [tally|unique].lab.Row, because we’ll first use them to label the rows of a heatmap. Heatmaps are a great way to start getting familiar with your data.

## make a heatmap of edge abundance

heat.col <- colorRampPalette(c('white', 'lightgoldenrod1', 'darkgreen'))(100)

heatmap(t(data.matrix(tally.select)),
        scale = NULL,
        col = heat.col,
        labRow = tally.lab.Row,
        margins = c(10, 10))
heatmap(t(data.matrix(unique.select)),
        scale = NULL,
        col = heat.col,
        labRow = unique.lab.Row,
        margins = c(10, 10))

Heatmaps are great for visualizing broad trends in the data, but they aren’t a good entry point for quantitative analysis. A good next step is to carry out some kind of ordination (NMDS, PCoA, PCA, CA). Not all ordination methods will work well for all types of data. Here we’ll use correspondence analysis (CA) on the relative abundance of the unique reads. CA will be carried out with the package “ca”, while “factoextra” will be used to parse the CA output and calculate key additional information. You can find a nice in-depth tutorial on correspondence analysis in R here.

library(ca)
library(factoextra)

unique.select.ca <- ca(unique.select)
unique.select.ca.var <- get_eigenvalue(unique.select.ca)
unique.select.ca.res <- get_ca_col(unique.select.ca)

species.x <- unique.select.ca$colcoord[,1]
species.y <- unique.select.ca$colcoord[,2]

samples.x <- unique.select.ca$rowcoord[,1]
samples.y <- unique.select.ca$rowcoord[,2]

dim.1.var <- round(unique.select.ca.var$variance.percent[1], 1)
dim.2.var <- round(unique.select.ca.var$variance.percent[2], 2)

plot(species.x, species.y,
     ylab = paste0('Dim 2: ', dim.2.var, '%'),
     xlab = paste0('Dim 1: ', dim.1.var, '%'),
     pch = 3,
     col = 'red')

points(samples.x, samples.y,
       pch = 19)

legend('topleft',
       legend = c('Samples', 'Unique reads'),
       pch = c(19, 3),
       col = c('black', 'red'))

At this point you’re ready to crack open the unique.select.ca object and start doing some hypothesis testing. There’s one more visualization, however, that can help with initial interpretation; a heatmap of the top unique edges contributing to the first two dimensions (which account for nearly all of the variance between samples).

species.contr <- unique.select.ca.res$contrib[,1:2]
species.contr.ordered <- species.contr[order(rowSums(species.contr), decreasing = T),]
species.contr.top <- species.contr.ordered[1:10,]

species.contr.lab <- unique.lab.Row[order(rowSums(abs(species.contr)), decreasing = T)]

heatmap(species.contr.top,
        scale = 'none',
        col = heat.col,
        Colv = NA,
        margins = c(10, 20),
        labRow = species.contr.lab[1:10],
        labCol = c('Dim 1', 'Dim 2'),
        cexCol = 1.5)

From this plot we see that quite a few different taxa are contributing approximately equally to Dim 1 (which accounts for much of the variance between samples), including several different Pelagibacter and Rhodobacteracea strains. That makes sense as the dominant environmental gradient in the study was inside vs. outside of San Diego Bay and we would expect these strains to be organized along such a gradient. Dim 2 is different with unique reads associated with Tropheryma whipplei and Rhodoluna lacicola contributing most. These aren’t typical marine strains, and if we look back at the original data we see that these taxa are very abundant in just two samples. These samples are the obvious outliers along Dim 2 in the CA plot.

In this tutorial we covered just the community structure output from paprica, but of course the real benefit to using paprica is its estimation of metabolic potential. These data are found in the *.ec_tally.csv and *path_tally.csv files, and organized in the same way as the edge and unique read abundance tables. Because of this they can be plotted and analyzed in the same way.


9595 Total Views 2 Views Today
This entry was posted in paprica. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield