Tutorial: Analysis with paprica

paprika

Paprika. Not to be confused with paprica.

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Getting Started

I’ve been making a lot of improvements to paprica, our program for conducting metabolic inference on 16S rRNA gene sequence libraries.  The following is a complete analysis example with paprica to clarify the steps described in the wiki, and to highlight some of the recent improvements to the method.  I’ll continue to update this tutorial as the method evolves.  This tutorial assumes that you have all the dependencies for paprica-run.sh installed and in your PATH.  If you’re a Mac user you can follow the instructions here.  Linux (including Windows Subsystem for Linux) users should use this script as a guide.  This tutorial has been re-written for the most recent version of paprica, and does not directly apply to v0.6 or earlier.

Although not required for this tutorial, I recommend that you have R installed (probably with RStudio) for downstream analysis, and the Archaeopteryx tree viewer.  Follow the appropriate instructions for your platform for R and RStudio.  Archaeopteryx is a little more convoluted; after first installing Java, install Archaeopteryx as such:

## Download the jar file, note that you might need to update this link to reflect the current version.  Visit https://sites.google.com/site/cmzmasek/home/software/archaeopteryx to check.
wget http://www.phyloxml.org/download/forester/forester_1050.jar
mv forester_1050.jar archaeopteryx.jar
## Download the configuration file and rename to something that works
wget http://www.phyloxml.org/download/forester/archaeopteryx/_aptx_configuration_file.txt
mv _aptx_configuration_file.txt aptx_configuration_file

Double click on archaeopteryx.jar to start the program.  Finally, this tutorial assumes that you are using the provided database of metabolic pathways and genome data included in the ref_genome_database directory.  If you want to build a custom database you should follow this tutorial.  All the dependencies are installed and tested?  Before we start let’s get familiar with some terminology.

closest completed genome (CCG): One type of edge: the most closely taxonomically related completed genome to a query read.  This term is only applicable for query reads that place to a terminal edge (branch tip) on the reference tree.

closest estimated genome (CEG): Another type of edge: the set of genes that are estimated to be found  in all members of a clade originating at a branch point in the reference tree.  This term is only applicable to query reads that place to an internal edge on the reference tree.  The CEG is the point of placement.

community structure: The taxonomic structure of a bacterial assemblage.

edge: An edge is a point of placement on a reference tree.  Think of it as a branch of the reference tree.  Edges take the place of OTUs in this workflow, and are ultimately far more interesting and informative than OTUs.  Refer to the pplacer documentation for more.

unique read: A read from a dataset that has been denoised using (e.g.) dada2.

metabolic structure: The abundance of different metabolic pathways within a bacterial assemblage.

reference tree: This is the tree of representative 16S rRNA gene sequences from all completed Bacterial genomes in Genbank.  The topology of the tree defines what pathways are predicted for internal branch points.

Overview

The basic steps that paprica takes during analysis are:

  1. Identify the reads that belong to the specified domain (bacteria, archaea, eukarya).
  2. Place the reads on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes (Archaea) or representatives from all phyla present in RefSeq genomes (Bacteria).  For bacteria, reads that place to a phylum (i.e. not an internal node) are then placed on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes belonging to that phylum.  The domain Eukarya follows a similar procedure as for Bacteria, except that an 18S rRNA gene tree is used and division is used rather than phylum.  The 18S rRNA gene sequences come from the PR2 database rather than RefSeq.
  3. For Bacteria and Archaea only, map the enzymes, pathways, and genome parameters present at each point of placement to the query reads.
  4. Use the points of placement to provide a consensus taxonomy for each query read.

Testing the Installation

You can test your installation of paprica and its dependencies using the provided test.bacteria.fasta or test.archaea.fasta files.  For test.bacteria.fasta, from the paprica directory:

./paprica-run.sh test bacteria

The first argument specifies the name of the input fasta file (test.bacteria.fasta) without the extension.  The second argument specified which domain you are analyzing for.  Executing the command produces a variety of output files in the paprica directory:

ls test*
temp.bacteria
test.archaea.fasta
test.archaeal16S.reads.txt
test.bacteria.clean.fasta
test.bacteria.clean.unique.align.sto
test.bacteria.clean.unique.count
test.bacteria.clean.unique.fasta
test.bacteria.combined_16S.bacteria.tax.placements.csv
test.bacteria.ec.csv
test.bacteria.edge_data.csv
test.bacteria.fasta
test.bacteria.pathways.csv
test.bacteria.sample_data.txt
test.bacteria.sum_ec.csv
test.bacteria.sum_pathways.csv
test.bacteria.unique_seqs.csv
test.bacterial16S.reads.txt
test.eukarya.fasta
test.eukaryote18S.reads.txt
test.fasta
test.unique.count
test.unique.fasta

Each sample fasta file that you run will produce similar output.  These files are described in detail in the Wiki here, with the following being particularly useful to you:

test.bacteria: Because multiple jplace and other intermediate files are created for the domains Bacteria and Eukarya the number of files quickly gets out of control.  Some of these files are useful for downstream analysis and troubleshooting, however, so they are retained in a directory with the sample name.

test.bacteria.combined_16S.bacteria.tax.placements.csv: The *placements.csv file represents a summary of the placement data for each unique query read and is the result of parsing the jplace file(s) produced by epa-ng.  For Bacteria and Eukarya, which have multiple reference trees, this file contains the phylogenetic placement results for all the reference trees for a given sample.

test.bacteria.bacteria.edge_data.csv: This is a csv format file containing data on edge location in the reference tree that received a placement, such as the number of reads that placed, predicted 16S rRNA gene copies, number of reads placed normalized to 16S rRNA gene copies, GC content, etc.  This file describes the taxonomic structure of your sample.

test.bacteria.unique_seqs.csv: This is a csv file that contains the abundance and normalized abundance of unique sequences. Each sequence is identified by a unique hash
(to allow tallying across multiple samples) and the name of a representative read is also provided.

test.bacteria.bacteria.sum_pathways.csv: This is a csv file of all the metabolic pathways inferred for test.bacteria.fasta, by edge.  All possible metabolic pathways are listed (meaning all metabolic pathways that are represented in the database), the number attributed to each edge is given in the column for that edge.

test.bacteria.bacteria.sum_ec.csv: This is a csv file of all the enzymes with EC numbers inferred for test.bacteria.fasta, by edge. The file is structured the same as test.bacteria.bacteria.sum_pathways.csv.

test.bacteria.bacteria.sample_data.txt: This file described some basic information for the sample, such as the database version that was used to make the metabolic inference, the confidence score, total reads used, etc.

test.bacteria.bacteria.sum_pathways.csv: This csv format file describes the metabolic structure of the sample, i.e. pathway abundance across all edges.

If you want to run an analysis with archaeal 16S rRNA or eukaryotic 18S rRNA gene reads you can test the same way.  Note that there is no metabolic inference for the domain eukarya, but you the reads are placed on an extensive set of reference trees that are useful for classification and phylogenetic analysis.

./paprica-run.sh test archaea
./paprica-run.sh test eukarya

Conducting an Analysis – read QC

Okay, that was all well and good for the test.fasta file, which has placements only for a single edge and is not particularly exciting.  Let’s try something a bit more advanced that mimics what you would do for an actual analysis.  First we need to get some data.  In this case we’ll use a set of samples from a recent publication, Webb et al. 2019.  You can acquire these data from the NCBI SRA site using prefetch.  First, create a file titled SRA_select_Run.txt, then populate the file with these accession numbers:

SRX4496910
SRX4496911
SRX4496912
SRX4496913
SRX4496914
SRX4496915
SRX4496916
SRX4496917
SRX4496918
SRX4496919
SRX4496920
SRX4496921
SRX4496922
SRX4496923
SRX4496924
SRX4496925
SRX4496926
SRX4496927
SRX4496928
SRX4496929
SRX4496930
SRX4496931
SRX4496932
SRX4496933
SRX4496934
SRX4496935
SRX4496936
SRX4496937
SRX4496938
SRX4496939
SRX4496940
SRX4496941
SRX4496942
SRX4496943
SRX4496944
SRX4496945
SRX4496946
SRX4496947
SRX4496948
SRX4496949
SRX4496950
SRX4496951
SRX4496952
SRX4496953
SRX4496954

Then execute the following command (if you don’t have it already, install Gnu parallel with your package manger of choice):

parallel fastq-dump --split-files --skip-technical {} < SRA_select_Run.txt

Now you’ll want to QC, denoise, and merge the reads.  I’m not going to cover that here, but strongly recommend using dada2.  You can use this R script, which is a modification of the tutorial on the dada2 site.  Merged reads should be inflated to redundant fasta files before analysis with paprica.  You can use our deunique_dada2.py script for that.

Conducting an Analysis – running paprica

Previously you executed paprica on just a test file (test.fasta).  Here we have a larger number of samples, and we need to construct a loop to run paprica on all of them.  First, copy the paprica/paprica-run.sh file into your working directory:

cp ~/paprica/paprica-run.sh paprica-run.sh

Paprica is pretty fast, and the key parts are already parallelized, so it doesn’t make sense to run the samples in parallel.  We can loop across multiple input files like this, specifying the file to run without the “.fasta” extension:

for f in *exp.fasta;do NAME=$(basename $f .fasta); ./paprica-run.sh $NAME bacteria; done

At this point you have individual analysis files for all your samples.  These can be quite useful, but most useful are classic abundance tables for edges, unique reads, enzymes, and pathways.  You can create these using the paprica-combine_results.py script.  This used to be a separate utility but is now a core part of paprica.  That means the script should reside in your paprica directory, and you call it from your working directory:

paprica-combine_results.py -domain bacteria -o 2017.07.03_seagrass

This produces several files, note that the prefix in the file names is set by the -o flag in the command above, and reflects the nature of these example reads:

2017.07.03_seagrass_bacteria.taxon_map.txt: This file maps edge numbers to the name of the lowest consensus taxonomy (for CEGs) or strain name (for CCGs).

2017.07.03_seagrass_bacteria.seq_edge_map.txt: This file maps unique sequences to the edge it most frequently places to across samples.  The vast majority of reads will place to only a single sample, however, the more samples you have and the more abundant a read is, the more likely it is that you’ll place to more than one edge.

2017.07.03_seagrass_bacteria.edge_data.csv: These are the mean genome parameters for each sample.  Lots of good stuff in here, see the column labels.

2017.07.03_seagrass_bacteria.edge_tally.csv: Edge abundance for each sample (corrected for 16S rRNA gene copy).  This is your community structure, and is equivalent to an OTU table (but much better!).

2017.07.03_seagrass_bacteria.unique_tally.csv: The abundance and normalized abundance of unique sequences.

2017.07.03_seagrass_bacteria.ec_tally.csv: Enzyme abundance for each sample.

2017.07.03_seagrass_bacteria.path_tally.csv: Metabolic pathway abundance for each sample.

To get familiar with some basic operations on the output files, including exploring the data with heatmaps and correspondence analysis, you can continue with this tutorial.

79897 Total Views 6 Views Today
This entry was posted in paprica. Bookmark the permalink.

11 Responses to Tutorial: Analysis with paprica

  1. Nathan says:

    Hi Jeff,

    Great program, and I’m excited to see my results. I successfully ran my own samples, but when I try to run the combine_results code, I end up with an error — specifically, it flags a nan value and throws a KeyError, which should be handled by the exception line in the code (see the copied source code below, line 256). However, it throws another error on line 248: “ValueError: cannot reindex on an axis with duplicate labels.” Any ideas on what could be going wrong? I didn’t want to try debug the source code without seeing if you had any ideas first. Thanks!

    try:
    unique_edge_abund.loc[seq, temp_unique.loc[seq, ‘global_edge_num’]] += temp_unique.loc[seq, ‘abundance_corrected’]
    except KeyError:
    unique_edge_abund.loc[seq, temp_unique.loc[seq, ‘global_edge_num’]] = temp_unique.loc[seq, ‘abundance_corrected’]

  2. Roman says:

    Dear Jeff,
    thank you very much for Paprica, it is a great software.
    May I ask you, I was wondering if there is an easy (semi/automated) way how to group identified pathways into a higher level. I mean in your paper Microbial Communities Can Be Described by Metabolic Structure: A General Framework and Application to a Seasonally Variable, Depth-Stratified Microbial Community from the Coastal West Antarctic Peninsula you have a nice figure 6. Metabolic pathways differentially present between summer surface samples and winter and deep samples. where you did group pathways into higher taxonomic level – e.g. Anaerobic metabolism, Nitrogen degradation etc.

    Thank you very much.

    Roman

    • Avatar photo Jeff says:

      Roman,
      I just replied on GitHub but including comment here. We don’t have a built-in way to do this as it gets a bit arbitrary; pathways are typically involved in multiple higher-order processes. The best thing to do is to explore BioCyc to create a hierarchy that makes the most sense for your data and project.

      Jeff

  3. Garima Raj says:

    Hi Jeff,
    Can you suggest how do i assign names to each E.C number? Its taxing to go through each number in enzyme database. I am sure there is a method, but I am fairly new to bioinformatics. Any suggestion will be very helpful.
    Thank you

  4. garima raj says:

    Hi Jeff, i have used the pathways result to build heatmap and i want to do analysis using the enzyme data, can you please tell how to assign enzyme names to the E.C numbers, there is a huge data and individually looking into each number is very time consuming.
    Thank you,
    Garima Raj

  5. Avatar photo Jeff says:

    I’ve made some long overdue updates to this tutorial to get it inline with the latest paprica features and command structure. If you’re working through it and you run into any problems please let me know!

  6. Roman says:

    Thanks for paprica, it is great piece of software I was looking for.

    Btw. I believe that the command for aggregating results is not
    python combine_edge_results.py my_analysis
    but rather
    python combine_edge_results.py my_analysis -edge_in [suffix pattern for edges] -path_in [suffix pattern for paths] -ec_in [suffix pattern for ec numbers] -o [prefix for output]
    e.g.
    python combine_edge_results.py my_analysis -edge_in bacteria.edge_data.csv -path_in bacteria.sum_pathways.csv -ec_in bacteria.sum_ec.csv -o my_analysis

  7. Zhijian Jiang says:

    when I test the installation using command in virtualbox “./paprica-run.sh test.bacteria bacteria”,
    error come out below:
    ./paprica-run.sh: line 18: paprica-place_it.py: command not found
    ./paprica-run.sh: line 23: paprica-tally_pathways.py: command not found

    Would you tell me how to fix this problem, many thanks!

    • Avatar photo Jeff says:

      Hmmm… I can’t reproduce this problem in the Virtualbox appliance. Can you confirm that you are using that VB appliance exactly as it came, i.e. that you didn’t attempt to carry out the installation steps within the VB appliance? Assuming the indicated Python scripts are located in the paprica directory this problem is usually caused by an issue with the shell script startup files.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield