This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.
Getting Started
I’ve been making a lot of improvements to paprica, our program for conducting metabolic inference on 16S rRNA gene sequence libraries. The following is a complete analysis example with paprica to clarify the steps described in the wiki, and to highlight some of the recent improvements to the method. I’ll continue to update this tutorial as the method evolves. This tutorial assumes that you have all the dependencies for paprica-run.sh installed and in your PATH. If you’re a Mac user you can follow the instructions here. Linux (including Windows Subsystem for Linux) users should use this script as a guide. This tutorial has been re-written for the most recent version of paprica, and does not directly apply to v0.6 or earlier.
Although not required for this tutorial, I recommend that you have R installed (probably with RStudio) for downstream analysis, and the Archaeopteryx tree viewer. Follow the appropriate instructions for your platform for R and RStudio. Archaeopteryx is a little more convoluted; after first installing Java, install Archaeopteryx as such:
## Download the jar file, note that you might need to update this link to reflect the current version. Visit https://sites.google.com/site/cmzmasek/home/software/archaeopteryx to check. wget http://www.phyloxml.org/download/forester/forester_1050.jar mv forester_1050.jar archaeopteryx.jar ## Download the configuration file and rename to something that works wget http://www.phyloxml.org/download/forester/archaeopteryx/_aptx_configuration_file.txt mv _aptx_configuration_file.txt aptx_configuration_file
Double click on archaeopteryx.jar to start the program. Finally, this tutorial assumes that you are using the provided database of metabolic pathways and genome data included in the ref_genome_database directory. If you want to build a custom database you should follow this tutorial. All the dependencies are installed and tested? Before we start let’s get familiar with some terminology.
closest completed genome (CCG): One type of edge: the most closely taxonomically related completed genome to a query read. This term is only applicable for query reads that place to a terminal edge (branch tip) on the reference tree.
closest estimated genome (CEG): Another type of edge: the set of genes that are estimated to be found in all members of a clade originating at a branch point in the reference tree. This term is only applicable to query reads that place to an internal edge on the reference tree. The CEG is the point of placement.
community structure: The taxonomic structure of a bacterial assemblage.
edge: An edge is a point of placement on a reference tree. Think of it as a branch of the reference tree. Edges take the place of OTUs in this workflow, and are ultimately far more interesting and informative than OTUs. Refer to the pplacer documentation for more.
unique read: A read from a dataset that has been denoised using (e.g.) dada2.
metabolic structure: The abundance of different metabolic pathways within a bacterial assemblage.
reference tree: This is the tree of representative 16S rRNA gene sequences from all completed Bacterial genomes in Genbank. The topology of the tree defines what pathways are predicted for internal branch points.
Overview
The basic steps that paprica takes during analysis are:
- Identify the reads that belong to the specified domain (bacteria, archaea, eukarya).
- Place the reads on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes (Archaea) or representatives from all phyla present in RefSeq genomes (Bacteria). For bacteria, reads that place to a phylum (i.e. not an internal node) are then placed on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes belonging to that phylum. The domain Eukarya follows a similar procedure as for Bacteria, except that an 18S rRNA gene tree is used and division is used rather than phylum. The 18S rRNA gene sequences come from the PR2 database rather than RefSeq.
- For Bacteria and Archaea only, map the enzymes, pathways, and genome parameters present at each point of placement to the query reads.
- Use the points of placement to provide a consensus taxonomy for each query read.
Testing the Installation
You can test your installation of paprica and its dependencies using the provided test.bacteria.fasta or test.archaea.fasta files. For test.bacteria.fasta, from the paprica directory:
./paprica-run.sh test bacteria
The first argument specifies the name of the input fasta file (test.bacteria.fasta) without the extension. The second argument specified which domain you are analyzing for. Executing the command produces a variety of output files in the paprica directory:
ls test* temp.bacteria test.archaea.fasta test.archaeal16S.reads.txt test.bacteria.clean.fasta test.bacteria.clean.unique.align.sto test.bacteria.clean.unique.count test.bacteria.clean.unique.fasta test.bacteria.combined_16S.bacteria.tax.placements.csv test.bacteria.ec.csv test.bacteria.edge_data.csv test.bacteria.fasta test.bacteria.pathways.csv test.bacteria.sample_data.txt test.bacteria.sum_ec.csv test.bacteria.sum_pathways.csv test.bacteria.unique_seqs.csv test.bacterial16S.reads.txt test.eukarya.fasta test.eukaryote18S.reads.txt test.fasta test.unique.count test.unique.fasta
Each sample fasta file that you run will produce similar output. These files are described in detail in the Wiki here, with the following being particularly useful to you:
test.bacteria: Because multiple jplace and other intermediate files are created for the domains Bacteria and Eukarya the number of files quickly gets out of control. Some of these files are useful for downstream analysis and troubleshooting, however, so they are retained in a directory with the sample name.
test.bacteria.combined_16S.bacteria.tax.placements.csv: The *placements.csv file represents a summary of the placement data for each unique query read and is the result of parsing the jplace file(s) produced by epa-ng. For Bacteria and Eukarya, which have multiple reference trees, this file contains the phylogenetic placement results for all the reference trees for a given sample.
test.bacteria.bacteria.edge_data.csv: This is a csv format file containing data on edge location in the reference tree that received a placement, such as the number of reads that placed, predicted 16S rRNA gene copies, number of reads placed normalized to 16S rRNA gene copies, GC content, etc. This file describes the taxonomic structure of your sample.
test.bacteria.unique_seqs.csv: This is a csv file that contains the abundance and normalized abundance of unique sequences. Each sequence is identified by a unique hash
(to allow tallying across multiple samples) and the name of a representative read is also provided.
test.bacteria.bacteria.sum_pathways.csv: This is a csv file of all the metabolic pathways inferred for test.bacteria.fasta, by edge. All possible metabolic pathways are listed (meaning all metabolic pathways that are represented in the database), the number attributed to each edge is given in the column for that edge.
test.bacteria.bacteria.sum_ec.csv: This is a csv file of all the enzymes with EC numbers inferred for test.bacteria.fasta, by edge. The file is structured the same as test.bacteria.bacteria.sum_pathways.csv.
test.bacteria.bacteria.sample_data.txt: This file described some basic information for the sample, such as the database version that was used to make the metabolic inference, the confidence score, total reads used, etc.
test.bacteria.bacteria.sum_pathways.csv: This csv format file describes the metabolic structure of the sample, i.e. pathway abundance across all edges.
If you want to run an analysis with archaeal 16S rRNA or eukaryotic 18S rRNA gene reads you can test the same way. Note that there is no metabolic inference for the domain eukarya, but you the reads are placed on an extensive set of reference trees that are useful for classification and phylogenetic analysis.
./paprica-run.sh test archaea ./paprica-run.sh test eukarya
Conducting an Analysis – read QC
Okay, that was all well and good for the test.fasta file, which has placements only for a single edge and is not particularly exciting. Let’s try something a bit more advanced that mimics what you would do for an actual analysis. First we need to get some data. In this case we’ll use a set of samples from a recent publication, Webb et al. 2019. You can acquire these data from the NCBI SRA site using prefetch. First, create a file titled SRA_select_Run.txt, then populate the file with these accession numbers:
SRX4496910 SRX4496911 SRX4496912 SRX4496913 SRX4496914 SRX4496915 SRX4496916 SRX4496917 SRX4496918 SRX4496919 SRX4496920 SRX4496921 SRX4496922 SRX4496923 SRX4496924 SRX4496925 SRX4496926 SRX4496927 SRX4496928 SRX4496929 SRX4496930 SRX4496931 SRX4496932 SRX4496933 SRX4496934 SRX4496935 SRX4496936 SRX4496937 SRX4496938 SRX4496939 SRX4496940 SRX4496941 SRX4496942 SRX4496943 SRX4496944 SRX4496945 SRX4496946 SRX4496947 SRX4496948 SRX4496949 SRX4496950 SRX4496951 SRX4496952 SRX4496953 SRX4496954
Then execute the following command (if you don’t have it already, install Gnu parallel with your package manger of choice):
parallel fastq-dump --split-files --skip-technical {} < SRA_select_Run.txt
Now you’ll want to QC, denoise, and merge the reads. I’m not going to cover that here, but strongly recommend using dada2. You can use this R script, which is a modification of the tutorial on the dada2 site. Merged reads should be inflated to redundant fasta files before analysis with paprica. You can use our deunique_dada2.py script for that.
Conducting an Analysis – running paprica
Previously you executed paprica on just a test file (test.fasta). Here we have a larger number of samples, and we need to construct a loop to run paprica on all of them. First, copy the paprica/paprica-run.sh file into your working directory:
cp ~/paprica/paprica-run.sh paprica-run.sh
Paprica is pretty fast, and the key parts are already parallelized, so it doesn’t make sense to run the samples in parallel. We can loop across multiple input files like this, specifying the file to run without the “.fasta” extension:
for f in *exp.fasta;do NAME=$(basename $f .fasta); ./paprica-run.sh $NAME bacteria; done
At this point you have individual analysis files for all your samples. These can be quite useful, but most useful are classic abundance tables for edges, unique reads, enzymes, and pathways. You can create these using the paprica-combine_results.py script. This used to be a separate utility but is now a core part of paprica. That means the script should reside in your paprica directory, and you call it from your working directory:
paprica-combine_results.py -domain bacteria -o 2017.07.03_seagrass
This produces several files, note that the prefix in the file names is set by the -o flag in the command above, and reflects the nature of these example reads:
2017.07.03_seagrass_bacteria.taxon_map.txt: This file maps edge numbers to the name of the lowest consensus taxonomy (for CEGs) or strain name (for CCGs).
2017.07.03_seagrass_bacteria.seq_edge_map.txt: This file maps unique sequences to the edge it most frequently places to across samples. The vast majority of reads will place to only a single sample, however, the more samples you have and the more abundant a read is, the more likely it is that you’ll place to more than one edge.
2017.07.03_seagrass_bacteria.edge_data.csv: These are the mean genome parameters for each sample. Lots of good stuff in here, see the column labels.
2017.07.03_seagrass_bacteria.edge_tally.csv: Edge abundance for each sample (corrected for 16S rRNA gene copy). This is your community structure, and is equivalent to an OTU table (but much better!).
2017.07.03_seagrass_bacteria.unique_tally.csv: The abundance and normalized abundance of unique sequences.
2017.07.03_seagrass_bacteria.ec_tally.csv: Enzyme abundance for each sample.
2017.07.03_seagrass_bacteria.path_tally.csv: Metabolic pathway abundance for each sample.
To get familiar with some basic operations on the output files, including exploring the data with heatmaps and correspondence analysis, you can continue with this tutorial.
Hi Jeff,
Great program, and I’m excited to see my results. I successfully ran my own samples, but when I try to run the combine_results code, I end up with an error — specifically, it flags a nan value and throws a KeyError, which should be handled by the exception line in the code (see the copied source code below, line 256). However, it throws another error on line 248: “ValueError: cannot reindex on an axis with duplicate labels.” Any ideas on what could be going wrong? I didn’t want to try debug the source code without seeing if you had any ideas first. Thanks!
try:
unique_edge_abund.loc[seq, temp_unique.loc[seq, ‘global_edge_num’]] += temp_unique.loc[seq, ‘abundance_corrected’]
except KeyError:
unique_edge_abund.loc[seq, temp_unique.loc[seq, ‘global_edge_num’]] = temp_unique.loc[seq, ‘abundance_corrected’]
Dear Jeff,
thank you very much for Paprica, it is a great software.
May I ask you, I was wondering if there is an easy (semi/automated) way how to group identified pathways into a higher level. I mean in your paper Microbial Communities Can Be Described by Metabolic Structure: A General Framework and Application to a Seasonally Variable, Depth-Stratified Microbial Community from the Coastal West Antarctic Peninsula you have a nice figure 6. Metabolic pathways differentially present between summer surface samples and winter and deep samples. where you did group pathways into higher taxonomic level – e.g. Anaerobic metabolism, Nitrogen degradation etc.
Thank you very much.
Roman
Roman,
I just replied on GitHub but including comment here. We don’t have a built-in way to do this as it gets a bit arbitrary; pathways are typically involved in multiple higher-order processes. The best thing to do is to explore BioCyc to create a hierarchy that makes the most sense for your data and project.
Jeff
Hi Jeff,
Can you suggest how do i assign names to each E.C number? Its taxing to go through each number in enzyme database. I am sure there is a method, but I am fairly new to bioinformatics. Any suggestion will be very helpful.
Thank you
This discussion ended up taking place on GitHub, you can check it out here: https://github.com/bowmanjeffs/paprica/issues/52. The next time I’m inside paprica I’ll try to tweak things so that the output contains enzyme names, as this is clearly useful!
Hi Jeff, i have used the pathways result to build heatmap and i want to do analysis using the enzyme data, can you please tell how to assign enzyme names to the E.C numbers, there is a huge data and individually looking into each number is very time consuming.
Thank you,
Garima Raj
I’ve made some long overdue updates to this tutorial to get it inline with the latest paprica features and command structure. If you’re working through it and you run into any problems please let me know!
Thanks for paprica, it is great piece of software I was looking for.
Btw. I believe that the command for aggregating results is not
python combine_edge_results.py my_analysis
but rather
python combine_edge_results.py my_analysis -edge_in [suffix pattern for edges] -path_in [suffix pattern for paths] -ec_in [suffix pattern for ec numbers] -o [prefix for output]
e.g.
python combine_edge_results.py my_analysis -edge_in bacteria.edge_data.csv -path_in bacteria.sum_pathways.csv -ec_in bacteria.sum_ec.csv -o my_analysis
You are correct! The tutorial is now lagging a few releases behind… I need to get it up to date!
when I test the installation using command in virtualbox “./paprica-run.sh test.bacteria bacteria”,
error come out below:
./paprica-run.sh: line 18: paprica-place_it.py: command not found
./paprica-run.sh: line 23: paprica-tally_pathways.py: command not found
Would you tell me how to fix this problem, many thanks!
Hmmm… I can’t reproduce this problem in the Virtualbox appliance. Can you confirm that you are using that VB appliance exactly as it came, i.e. that you didn’t attempt to carry out the installation steps within the VB appliance? Assuming the indicated Python scripts are located in the paprica directory this problem is usually caused by an issue with the shell script startup files.