Last week I gave a talk at the biennial Ocean Sciences Meeting that included some results from analysis with paprica. Since paprica is a relatively new method I showed the below figure to demonstrate that paprica works. The figure shows a strong correlation for four metagenomes between observed enzyme abundance and enzyme abundance predicted with paprica (from 16S rRNA gene reads extracted from the metagenome). This is similar to the approach used to validate PICRUSt and Tax4Fun.
The correlation looks decent, right? It’s not perfect, but most enzymes are being predicted at close to their observed abundance (excepting the green points where enzyme abundance is over-predicted because metagenome coverage is lower).
After the talk I was approached by a well known microbial ecologist who suggested that I compare these correlations to correlations with a random collection of enzymes. His concern was that because many enzymes (or genes, or metabolic pathways) are widely shared across genomes any random collection of genomes looks sort of like a metagenome. I gave this a shot and here are the results for one of the metagenomes used in the figure above.
Uh oh. The correlation is better for predicted than random enzyme abundance, but rho = 0.7 is a really good correlation for the random dataset! If you think about it however, this makes sense. For this test I generated the random dataset by randomly selecting genomes from the paprica database until the total number of enzymes equaled the number predicted for the metagenome. Because there are only 2,468 genomes in the current paprica database (fewer than the total number of completed genomes because only one genome is used for each unique 16S rRNA gene sequence) the database gets pretty well sampled during random selection. As a result rare enzymes (which are also usually rare in the metagenome) are rare in the random sample, and common enzymes (also typically common in the metagenome) are common. So random ends up looking a lot like observed.
It was further suggested that I try and remove core enzymes for this kind of test. Here are the results for different definitions of “core”, ranging from enzymes that appear in less than 100 % of genomes (i.e. all enzymes, since no EC numbers appeared in all genomes) to those that appear in less than 1 % of genomes.
The difference between the random and predicted correlations does change as the definition of the core group of enzymes changes. Here’s the data aggregated for all four metagenomes in the form of a sad little Excel plot (error bars give standard deviation).
This suggests to me a couple of things. First, although I was initially surprised at the high correlation between a random and observed set of enzymes, I’m heartened that paprica consistently does better. There’s plenty of room for improvement (and each new build of the database does improve as additional genomes are completed – the last build added 78 new genomes, see the current development version) but the method does work. Second, that we obtain maximum “sensitivity”, defined as improvement over the random correlation, for enzymes that are present in fewer than 10 % of the genomes in that database. Above that and the correlation is inflated (but not invalidated) by common enzymes, below that we start to lose predictive power. This can be seen in the sharp drop in the predicted-random rho (Δrho: is it bad form to mix greek letters with the English version of same?) for enzymes present in less than 1 % of genomes. Because lots of interesting enzymes are not very common this is where we have to focus our future efforts. As I mentioned earlier some improvement in this area is automatic; each newly completed genome improves our resolution.
Some additional thoughts on this. There are parameters in paprica that might improve Δrho. The contents of closest estimated genomes are determined by a cutoff value – the fraction of descendant genomes a pathway or enzyme appears in. I redid the Δrho calculations for different cutoff values, ranging from 0.9 to 0.1. Surprisingly this had only a minor impact on Δrho. The reason for this is that most of the 16S reads extracted from the metagenomes placed to closest completed genomes (for which cutoff is meaningless) rather than closest estimated genomes. An additional consideration is that I did all of these calculations for enzyme predictions/observations instead of metabolic pathways. The reason for this is that predicting metabolic pathways on metagenomes is rather complicated (but doable). Pathways have the advantage of being more conserved than enzymes however, so I expect to see an improved Δrho when I get around to redoing these calculations with pathways.
Something else that’s bugging me a bit… metagenomes aren’t sets of randomly distributed genomes. Bacterial community structure is usually logarithmic, with a few dominant taxa and a long tail of rare taxa. The metabolic inference methods by their nature capture this distribution. A more interesting test might be to create a logarithmically distributed random population of genomes, but this adds all kinds of additional complexities. Chief among them being the need to create many random datasets with different (randomly selected) dominant taxa. That seems entirely too cumbersome for this purpose…
So to summarize…
- Metabolic inference definitively outperforms random selection. This is good, but I’d like the difference (Δrho) to be larger than it is.
- It is not adequate to validate a metabolic inference technique using correlation with a metagenome alone. The improvement over a randomly generated dataset should be used instead.
- paprica, and probably other metabolic inference techniques, have poor predictive power for rare (i.e. very taxonomically constrained) enzymes/pathways. This shouldn’t surprise anyone.
- Alternate validation techniques might be more useful than correlating with the abundance of enzymes/pathways in metagenomes. Alternatives include correlating the distance in metabolic structure between samples with distance in community structure, as we did in this paper, or correlating predictions for draft genomes. In that case it would be necessary to generate a distribution of correlation values for the draft genome against the paprica (or other method’s) database, and see where the correlation for the inferred metabolism falls in that distribution. Because the contents of a draft genome are a little more constrained than the contents of a metagenome I think I’m going to spend some time working on this approach…