The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads. To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank. In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically. The following is a brief analysis of some genomic features using the paprica database and R. If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.
The paprica pipeline and database can be obtained from Github here. In this post I’ll be using the database associated with version 0.3.1. The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:
## Read in the pathways associated with the terminal nodes on the reference tree path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1) path[is.na(path)] <- 0 ## Read in the data associated with all completed genomes in Genbank data <- read.csv('paprica/ref_genome_database/bacteria/genome_data.final.csv', row.names = 1) ## During database creation genomes with duplicate 16S rRNA genes were removed, ## so limit to those that were retained data <- data[row.names(data) %in% row.names(path),] ## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree, ## however, "data" is not, so order it here data <- data[order(data$clade),]
One fun thing to do at this point is to look at the distribution of metabolic pathways across the database. To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.
## The pathway file in the database is binary, so we use Jaccard for distance library('vegan') path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!) path.clust <- hclust(path.dist)
The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.
## Set a binary color scheme image.col <- colorRampPalette(c('white', 'blue'))(2) ## Image will order matrix in ascending order, which is not what we want here! image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1], col = image.col, ylab = 'Genome', xlab = 'Pathway', xaxt = 'n', yaxt = 'n') box()
There are a couple of interesting things to note in this plot. First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage. For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough. Second, there are some genomes that have very few pathways. There are a couple of possible reasons for this that can be explored with a little more effort. One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction. Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts). Obligate symbionts often have highly streamlined genomes and few complete pathways. We can compare the number of pathways in each genome to other genome characteristics for additional clues.
A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome. Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.
## Plot the number of pathways as a function of genome size plot(rowSums(path) ~ data$genome_size, ylab = 'nPaths', xlab = 'Genome size') ## Plot P. ubique as a reference point select <- grep('Pelagibacter ubique HTCC1062', data$organism_name) points(rowSums(path)[select] ~ data$genome_size[select], pch = 19, col = 'red')
That looks pretty good. For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions. The red point gives the location of Pelagibacter ubique HTCC1062. This marine bacterium is optimized for life under energy-limited conditions. Among its adaptations are a highly efficient and streamlined genome. In fact it has the smallest genome of any known free-living bacterium. All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs. There are a few larger genomes that have very few (or even no) pathways predicted. These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).
The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity. We can make another plot to show the distribution of these parameters with respect to phylogeny.
## Grab only the data columns we want data.select <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')] ## Make the units somewhat comparable on the same scale, a more ## careful approach would log-normalize some of the units first data.select.norm <- decostand(data.select, method = 'standardize') data.select.norm <- decostand(data.select.norm, method = 'range') ## Plot with a heatmap heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100) heatmap(data.matrix(data.select.norm), margins = c(10, 20), col = heat.col, Rowv = NA, Colv = NA, scale = NULL, labRow = 'n', cexCol = 0.8)
Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting. These two parameters can be plotted to get a better view:
plot(data.select$phi ~ data.select$GC, xlab = 'GC', ylab = 'phi')
Okay, not so much… but I think the pattern here is pretty interesting. Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes. This can be evaluated with linear models for genomes above and below 50 % GC.
gc.phi.above50 <- lm(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)]) gc.phi.below50 <- lm(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)]) summary(gc.phi.above50) summary(gc.phi.below50) plot(data.select$phi ~ data.select$GC, xlab = 'GC', ylab = 'phi', type = 'n') points(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)], col = 'blue') points(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)], col = 'red') abline(gc.phi.above50, col = 'blue') abline(gc.phi.below50, col = 'red') legend('bottomleft', bg = 'white', legend = c('GC >= 50', 'GC < 50'), col = c('blue', 'red'), pch = 1)
As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0). So what’s going on here? Low GC content is associated with particular microbial lineages but also with certain ecologies. The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is). Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.