I’m very excited to report that our latest paper – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula was just published in the journal PLoS one. The paper builds on two very distinct bodies of work; a growing literature on microbial community structure and function along the climatically sensitive West Antarctic Peninsula, and a family of new techniques to predict community metabolic function from 16S rRNA gene libraries, which we are calling metabolic inference.
The motivation for metabolic inference is in the large amount of time that it takes to manually curate a likely set of functions for even a small collection of 16S rRNA genes. In today’s world, where most analyses of microbial community structure consist of many thousand of reads representing hundreds of taxa, it is simply impossible to dig through the literature on each strain to see what metabolic role each is likely to be playing. Ideally a researcher would use metagenomics or metatranscriptomics to get at this information directly, but it is not advisable or desirable in most cases to sequence hundreds of metagenomes or metatranscriptomes (necessary for the kind of temporal or spatial resolution many of us want these days). Metabolic inference provides a convenient alternative.

A quick Google Scholar survey of the number of studies since 2005 that have used high throughput 16S rRNA gene sequencing. Over the last ten years we’ve collected an astonishing amount of sequence data from a diverse array of environments, however, much of this data has been from taxonomic marker genes like the 16S rRNA gene, leaving microbial community function largely unknown. PAPRICA and other methods that try to infer microbial functional potential from 16S rRNA gene data can help bridge this gap.
The basic concept behind all metabolic inference techniques (e.g. PICRUSt, tax4fun, PAPRICA) is hidden state prediction (HSP) (you can find a nice paper on HSP here). In 16S rRNA gene analysis metabolic potential is a hidden state. The metabolic inference techniques propose different ways to predict this hidden state based on the information available.
Our small contribution to this effort was to develop a method (PAPRICA – PAthway PRediction by phylogenetIC plAcement) that uses phylogenetic placement to conduct the metabolic inference instead of an OTU (operational taxonomic unit) based approach. Our approach provides a more intuitive connection between the 16S rRNA analysis and the HSP (or at least it does in my mind) and can increase the accuracy of the inference for taxa that have a lot of sequenced genomes.
Most analysis of large 16S rRNA datasets rely on an OTU based approach. In a typical OTU analysis an investigator aligns 16S rRNA reads, constructs a distance matrix of the alignments, and clusters the reads at some predetermined distance. By tradition the default distance has become a dissimilarity of 0.03. This approach has some advantages. By clustering reads into discrete units it is easy to quantify the presence or absence of different OTUs, and it allows microbial ecologists to avoid problems with defining prokaryotic species (which defy most of the criteria used to define species in more complex organisms). To conduct a metabolic inference on an OTU based analyses it is possible to simply reconstruct the likely metabolism for a predefined set of OTUs based on the OTU assignments of published genomes. This works great, but it limits the resolution of the inference to the selected OTU definition (i.e. 0.03). For some taxa, such as Escherichia coli (and plenty of more interesting environmental bugs), there are many sequenced genomes that have very similar 16S rRNA gene sequences. PAPRICA provides a way to improve the resolution of the metabolic inference for these taxa.
Our approach was to build a phylogenetic tree of the 16S rRNA genes from each completed genome. For each internal node on the reference tree we determine a “consensus genome”, defined as all genomes shared by all members of the clade originating from the node, and predict the metabolic pathways present in the consensus and complete genomes using Pathway-Tools. To conduct the actual analysis we use pplacer to place our query reads on the reference tree and assign the metabolic pathways for each point of placement to the query reads. One advantage to this approach is that the resolution changes depending on genomes sequence coverage of the reference tree. For families, genera, and even species for which lots of genomes have been sequenced resolution is high. For regions of the tree where there are not many sequenced genomes resolution is poor, however, the method will give you the best of what’s available.

Figure from Bowman and Ducklow, 2015. PAPRICA includes a confidence scoring metric that takes into account the relative plasticity of different genomes. In this figure each vertical line is a genome (representing a numbered terminal node on our reference tree), with the height and color of the vertical line giving its relative plasticity (which we refer to as the parameter phi). The genomes identified with Roman numerals are all known to be exceptionally modified, which is a nice validation of the phi parameter. Many of these are obligate symbionts. I) Nanoarcheum equitans II) the Mycobacteria III) a butyrate producing bacterium within the Clostridium IV) Candidatus Hodgkinia circadicola V) the Mycoplasma VI) Sulcia muelleri VII) Portiera aleyrodidanum VIII) Buchnera aphidicola, IX) the Oxalobacteraceae.
PAPRICA provides some additional helpful pieces of information. We built in a confidence scoring metric that takes into account both predicted genomic plasticity and the size of the consensus genome relative to the mean size for the clade (deeper branching clades will have a bigger difference), and predicts the size of the genome and number of 16S rRNA gene copies associated with each 16S rRNA gene, both of which have a strong connection to the ecological role of a bacterium
For our initial application of PAPRICA we selected a previously published 16S rRNA gene sequence dataset from the West Antarctic Peninsula (our primary region of interest). One thing that we were very interested in looking at was whether we could describe differences between microbial communities organized along ecological gradients (e.g. inshore vs. offshore, or surface vs. deep water) in terms of metabolic structure in place of the more traditional 16S rRNA gene (i.e. taxonomic) structure. Using PAPRICA to convert the 16S rRNA gene sequences into collections of metabolic pathways we found that we could reconstruct the same inter-sample relationships identified by an analysis of taxonomic structure. This means that a microbial ecologist can, if they choose, disregard the messy and sometimes uninformative taxonomic structure data and go directly to metabolic structure without losing information. Applying common multivariate statistical approaches (PCA, MDS, etc.) to metabolic structure data yields information like which pathways are driving the variance between sites, and which are correlated with what environmental parameters. This information is much more relevant to most research questions than the distribution of different microbial taxa. It is worth noting that while inter-sample relationships are well preserved in metabolic structure, the absolute distance between samples is much less than for taxonomic structure. This might have some implications for the functional resilience of microbial communities, which we get into a little bit in the paper.
PAPRICA was an outgrowth of a couple of other papers that I’m working on. At some point the bioinformatic methods reached a point where separate publication was justified. As a result, and reflecting the fact that I’m much more an ecologist than a computational biologist, PAPRICA is not nearly as streamlined as PICRUSt (which is even available through an online interface). I’ve spent quite a bit of time, however, trying to make the scripts user friendly and transportable. Anyone should be able to get them to work without too much difficulty. If you decide to give PAPRICA a try and run into an hitches please let me know, either by posting an issue in Github or emailing me directly! Suggestions for improvement are also welcome.