I’m very excited to report that our latest paper – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula was just published in the journal PLoS one. The paper builds on two very distinct bodies of work; a growing literature on microbial community structure and function along the climatically sensitive West Antarctic Peninsula, and a family of new techniques to predict community metabolic function from 16S rRNA gene libraries, which we are calling metabolic inference.
The motivation for metabolic inference is in the large amount of time that it takes to manually curate a likely set of functions for even a small collection of 16S rRNA genes. In today’s world, where most analyses of microbial community structure consist of many thousand of reads representing hundreds of taxa, it is simply impossible to dig through the literature on each strain to see what metabolic role each is likely to be playing. Ideally a researcher would use metagenomics or metatranscriptomics to get at this information directly, but it is not advisable or desirable in most cases to sequence hundreds of metagenomes or metatranscriptomes (necessary for the kind of temporal or spatial resolution many of us want these days). Metabolic inference provides a convenient alternative.
The basic concept behind all metabolic inference techniques (e.g. PICRUSt, tax4fun, PAPRICA) is hidden state prediction (HSP) (you can find a nice paper on HSP here). In 16S rRNA gene analysis metabolic potential is a hidden state. The metabolic inference techniques propose different ways to predict this hidden state based on the information available.
Our small contribution to this effort was to develop a method (PAPRICA – PAthway PRediction by phylogenetIC plAcement) that uses phylogenetic placement to conduct the metabolic inference instead of an OTU (operational taxonomic unit) based approach. Our approach provides a more intuitive connection between the 16S rRNA analysis and the HSP (or at least it does in my mind) and can increase the accuracy of the inference for taxa that have a lot of sequenced genomes.
Most analysis of large 16S rRNA datasets rely on an OTU based approach. In a typical OTU analysis an investigator aligns 16S rRNA reads, constructs a distance matrix of the alignments, and clusters the reads at some predetermined distance. By tradition the default distance has become a dissimilarity of 0.03. This approach has some advantages. By clustering reads into discrete units it is easy to quantify the presence or absence of different OTUs, and it allows microbial ecologists to avoid problems with defining prokaryotic species (which defy most of the criteria used to define species in more complex organisms). To conduct a metabolic inference on an OTU based analyses it is possible to simply reconstruct the likely metabolism for a predefined set of OTUs based on the OTU assignments of published genomes. This works great, but it limits the resolution of the inference to the selected OTU definition (i.e. 0.03). For some taxa, such as Escherichia coli (and plenty of more interesting environmental bugs), there are many sequenced genomes that have very similar 16S rRNA gene sequences. PAPRICA provides a way to improve the resolution of the metabolic inference for these taxa.
Our approach was to build a phylogenetic tree of the 16S rRNA genes from each completed genome. For each internal node on the reference tree we determine a “consensus genome”, defined as all genomes shared by all members of the clade originating from the node, and predict the metabolic pathways present in the consensus and complete genomes using Pathway-Tools. To conduct the actual analysis we use pplacer to place our query reads on the reference tree and assign the metabolic pathways for each point of placement to the query reads. One advantage to this approach is that the resolution changes depending on genomes sequence coverage of the reference tree. For families, genera, and even species for which lots of genomes have been sequenced resolution is high. For regions of the tree where there are not many sequenced genomes resolution is poor, however, the method will give you the best of what’s available.
PAPRICA provides some additional helpful pieces of information. We built in a confidence scoring metric that takes into account both predicted genomic plasticity and the size of the consensus genome relative to the mean size for the clade (deeper branching clades will have a bigger difference), and predicts the size of the genome and number of 16S rRNA gene copies associated with each 16S rRNA gene, both of which have a strong connection to the ecological role of a bacterium
For our initial application of PAPRICA we selected a previously published 16S rRNA gene sequence dataset from the West Antarctic Peninsula (our primary region of interest). One thing that we were very interested in looking at was whether we could describe differences between microbial communities organized along ecological gradients (e.g. inshore vs. offshore, or surface vs. deep water) in terms of metabolic structure in place of the more traditional 16S rRNA gene (i.e. taxonomic) structure. Using PAPRICA to convert the 16S rRNA gene sequences into collections of metabolic pathways we found that we could reconstruct the same inter-sample relationships identified by an analysis of taxonomic structure. This means that a microbial ecologist can, if they choose, disregard the messy and sometimes uninformative taxonomic structure data and go directly to metabolic structure without losing information. Applying common multivariate statistical approaches (PCA, MDS, etc.) to metabolic structure data yields information like which pathways are driving the variance between sites, and which are correlated with what environmental parameters. This information is much more relevant to most research questions than the distribution of different microbial taxa. It is worth noting that while inter-sample relationships are well preserved in metabolic structure, the absolute distance between samples is much less than for taxonomic structure. This might have some implications for the functional resilience of microbial communities, which we get into a little bit in the paper.
PAPRICA was an outgrowth of a couple of other papers that I’m working on. At some point the bioinformatic methods reached a point where separate publication was justified. As a result, and reflecting the fact that I’m much more an ecologist than a computational biologist, PAPRICA is not nearly as streamlined as PICRUSt (which is even available through an online interface). I’ve spent quite a bit of time, however, trying to make the scripts user friendly and transportable. Anyone should be able to get them to work without too much difficulty. If you decide to give PAPRICA a try and run into an hitches please let me know, either by posting an issue in Github or emailing me directly! Suggestions for improvement are also welcome.
Hi
I am having problems with paprica’s execution, I am reciving the following erro messenger in pandas as follow
Traceback (most recent call last):
File “/usr/local/bin/paprica-tally_pathways.py”, line 117, in
genome_data = pd.DataFrame.from_csv(ref_dir_domain + ‘genome_data.final.csv’, header = 0, index_col = 0)
File “/usr/lib64/python2.7/site-packages/pandas/core/frame.py”, line 1177, in from_csv
infer_datetime_format=infer_datetime_format)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 590, in __init__
self._make_engine(self.engine)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File “pandas/parser.pyx”, line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File “pandas/parser.pyx”, line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /usr/local/bin/ref_genome_database/bacteria/genome_data.final.csv does not exist
Daniel, I think I found the bug – change line 52 in paprica-tally_pathways.py to:
paprica_path = os.path.dirname(os.path.abspath("__file__")) + '/'
Alternatively reclone from the repository as that version should be up to date. Let me know if still having problems.
Hi Jeff,
The error persist. I am receiving the following.
Could you help me with that, please?
./paprica-run.sh test.bacteria bacteria
# cmalign :: align sequences to a CM
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
# CM file: /usr/local/install/temp/paprica-paprica_v0.4.0a/models/bacteria_ssu.cm
# sequence file: /usr/local/install/temp/paprica-paprica_v0.4.0a/test.bacteria.clean.fasta
# CM name: SSU_rRNA_bacteria
# saving alignment to file: /usr/local/install/temp/paprica-paprica_v0.4.0a/test.bacteria.clean.align.sto
# output alignment format specified as: Pfam
# output alignment alphabet: DNA
# number of worker threads: 64
# – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
# running time (s)
# ——————————-
# idx seq name length cm from cm to trunc bit sc avg pp band calc alignment total mem (Mb)
# — ————– —— ——- ——- —– ——– —— ——— ——— ——— ——–
1 SRR953432.50 84 968 1054 3′ 39.93 0.974 0.08 0.30 0.38 2.91
2 SRR953432.90 84 968 1054 3′ 39.93 0.974 0.06 0.21 0.27 2.91
3 SRR953432.97 84 968 1054 3′ 39.93 0.974 0.09 0.29 0.37 2.91
4 SRR953432.307 84 968 1054 3′ 39.93 0.974 0.08 0.30 0.38 2.91
5 SRR953432.386 84 968 1054 3′ 39.93 0.974 0.08 0.29 0.37 2.91
6 SRR953432.641 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.38 2.91
7 SRR953432.836 84 968 1054 3′ 39.93 0.974 0.08 0.29 0.38 2.91
8 SRR953432.860 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.37 2.91
9 SRR953432.903 84 968 1054 3′ 39.93 0.974 0.08 0.30 0.38 2.91
10 SRR953432.1226 84 968 1054 3′ 39.93 0.974 0.07 0.29 0.36 2.91
11 SRR953432.1336 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.37 2.91
12 SRR953432.1341 84 968 1054 3′ 39.93 0.974 0.09 0.27 0.36 2.91
13 SRR953432.1426 84 968 1054 3′ 39.93 0.974 0.09 0.27 0.36 2.91
14 SRR953432.1527 84 968 1054 3′ 39.93 0.974 0.07 0.28 0.35 2.91
15 SRR953432.1647 84 968 1054 3′ 39.93 0.974 0.09 0.27 0.36 2.91
16 SRR953432.2076 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.37 2.91
17 SRR953432.2160 84 968 1054 3′ 39.93 0.974 0.07 0.31 0.38 2.91
18 SRR953432.2164 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.36 2.91
19 SRR953432.2192 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.37 2.91
20 SRR953432.2374 84 968 1054 3′ 39.93 0.974 0.07 0.29 0.36 2.91
21 SRR953432.2713 84 968 1054 3′ 39.93 0.974 0.09 0.28 0.37 2.91
22 SRR953432.2934 84 968 1054 3′ 39.93 0.974 0.09 0.28 0.37 2.91
23 SRR953432.3261 84 968 1054 3′ 39.93 0.974 0.07 0.31 0.37 2.91
24 SRR953432.3998 84 968 1054 3′ 39.93 0.974 0.09 0.29 0.37 2.91
25 SRR953432.4090 84 968 1054 3′ 39.93 0.974 0.07 0.30 0.37 2.91
26 SRR953432.4248 84 968 1054 3′ 39.93 0.974 0.07 0.29 0.36 2.91
27 SRR953432.4378 84 968 1054 3′ 39.93 0.974 0.07 0.28 0.36 2.91
28 SRR953432.4573 84 968 1054 3′ 39.93 0.974 0.07 0.31 0.38 2.91
#
# CPU time: 4.63u 2.63s 00:00:07.26 Elapsed: 00:00:01.58
# Saving alignment to file /usr/local/install/temp/paprica-paprica_v0.4.0a/test.bacteria.combined_16S.bacteria.tax.clean.align.sto … done
#
# CPU time: 0.11u 0.03s 00:00:00.14 Elapsed: 00:00:00.15
Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /usr/local/install/temp/paprica-paprica_v0.4.0a/test.bacteria.combined_16S.bacteria.tax.clean.align.fasta…
Found reference sequences in given alignment file. Using those for reference alignment.
Pre-masking sequences… sequence length cut from 2630 to 84.
Note: you have 84 sites after pre-masking. That means there is rather little information in these sequences for placement.
Determining figs… figs disabled.
Allocating memory for internal nodes… done.
Caching likelihood information on reference tree… done.
Pulling exponents… done.
Preparing the edges for baseball… done.
working on SRR953432.4573 (1/1)…
Traceback (most recent call last):
File “/usr/local/bin/paprica-tally_pathways.py”, line 117, in
genome_data = pd.DataFrame.from_csv(ref_dir_domain + ‘genome_data.final.csv’, header = 0, index_col = 0)
File “/usr/lib64/python2.7/site-packages/pandas/core/frame.py”, line 1177, in from_csv
infer_datetime_format=infer_datetime_format)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 590, in __init__
self._make_engine(self.engine)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File “pandas/parser.pyx”, line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File “pandas/parser.pyx”, line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /usr/local/bin/ref_genome_database/bacteria/genome_data.final.csv does not exist
I’m replying to you via email to keep the comment thread clean. You seem to have paprica installed in a very odd location which could be part of the problem. I recommend cloning the repository to your home directory. Try that and let me know if the problem persists.
Dear Jeff,
May I ask you, how does Paprica handle 16S rRNA copy number?
I can read in the text “… number of 16S rRNA gene copies associated with each 16S rRNA gene …“ does it mean that 16S rRNA gene copies are taken into account when computing abundance of pathways/enzymes or should it be corrected explicitly/subsequently by user/me?
I would like to thank you for a great piece of software, it is wonderful work.
Roman, great question, I should make that more clear in the documentation. The copy number correction is applied automatically to enzymes and pathways. You will notice that the total number for each enzyme or pathway is often not an integer as a result.
Thanks Jeff, it is great …
Aneesha,
Thanks for giving paprica a try. You’ll receive an email from me shortly with some troubleshooting instructions.
-Jeff
I have cloned the genome_finder repositiory from git and am trying to run paprica_run.sh on a Fasta file of representative OTU sequences from human microbiome, with the following error. Please let me know a work through.
MacQIIME*****-MacBook-Air:genome_finder $ paprica_run.sh /Users/*****/Desktop/Saliva_rep_set
cat: /Users/****/genome_finder/genome_finder/ref_genome_database/combined_16S.tax.refpkg/combined_16S.tax.clean.align.fasta: No such file or directory
/bin/bash: seqmagick: command not found
Failed to parse command line: No such option “–outformat”.
Usage: cmalign [-options]
Usage: cmalign [-options] –merge
The –merge option merges the two alignments in and
created by previous runs of cmalign with into a single alignment.
To see more help on available options, do cmalign -h
/bin/bash: seqmagick: command not found
/bin/bash: seqmagick: command not found
Uncaught exception: Sys_error(“/Users/*****/genome_finder/genome_finder/ref_genome_database/combined_16S.tax.refpkg: No such file or directory”)
Running pplacer v1.1.alpha13r2-0-g79a8847 analysis on /Users/*****/Desktop/Saliva_rep_set.combined_16S.tax.clean.align.fasta…
Fatal error: exception Sys_error(“/*****/aneesha/genome_finder/genome_finder/ref_genome_database/combined_16S.tax.refpkg: No such file or directory”)
Uncaught exception: Sys_error(“/Users/*****/Desktop/Saliva_rep_set.combined_16S.tax.clean.align.jplace: No such file or directory”)
Fatal error: exception Sys_error(“/Users/*****/Desktop/Saliva_rep_set.combined_16S.tax.clean.align.jplace: No such file or directory”)
Uncaught exception: Sys_error(“/Users/*****/Desktop/Saliva_rep_set.combined_16S.tax.clean.align.jplace: No such file or directory”)
Fatal error: exception Sys_error(“/Users/*****/Desktop/Saliva_rep_set.combined_16S.tax.clean.align.jplace: No such file or directory”)
Traceback (most recent call last):
File “paprica_tally_pathways_v0.20.py”, line 49, in
genome_data = pd.DataFrame.from_csv(variables[‘ref_dir’] + ‘genome_data.final.csv’, header = 0, index_col = 0)
File “/macqiime/anaconda/lib/python2.7/site-packages/pandas/core/frame.py”, line 1036, in from_csv
infer_datetime_format=infer_datetime_format)
File “/macqiime/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py”, line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File “/macqiime/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py”, line 250, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File “/macqiime/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py”, line 566, in __init__
self._make_engine(self.engine)
File “/macqiime/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py”, line 705, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File “/macqiime/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py”, line 1072, in __init__
self._reader = _parser.TextReader(src, **kwds)
File “pandas/parser.pyx”, line 350, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)
File “pandas/parser.pyx”, line 594, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5912)
IOError: File /Users/*****/genome_finder/genome_finder/ref_genome_database/genome_data.final.csv does not exist