Phylogenetic placement re-re-visited

I use phylogenetic placement, namely the program pplacer, in a lot of my publications.  It is also a core part of of the paprica metabolic inference pipeline.  As a result I field a lot questions from people trying to integrate pplacer into their own workflows.  Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users.  In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al.  To follow along with the tutorial start by downloading the sequences here.

You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement.  I strongly recommend using RAxML and Infernal.  After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees.  You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect.  This is double the case for this tutorial… I haven’t come across a covariance model for the 16S/18S gene across domains (if you have please post a comment!), so we will use the covariance model for the domain Bacteria.  You can find that at the Rfam database here.  For the purpose of this tutorial the cm has been renamed “16S_bacteria.cm”.  While you’re installing software I also recommend the excellent utility Seqmagick.

Although this tutorial was developed specifically for the 16S rRNA gene, the method is easily adapted to any DNA or protein sequence.  Simply swap out the cmalign command in Infernal for the hmmalign command in HMMER3.0, and download (or create) an appropriate HMM.  You may also need to swap out the model used by RAxML.

The workflow will follow these steps:

  1. Create an alignment of the reference sequences with Infernal
  2. Create a phylogenetic tree of the alignment
  3. Create a reference package from the alignment, tree, and stats file
  4. Proceed with the phylogenetic placement of your query reads

Create an alignment of the reference sequences

The very first thing that you need to do is clean your sequence names of any wonky punctuation.  This is something that trips up almost everyone.  You should really have no punctuation in the names beyond “_”, and no spaces!

tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < hug_tol.fasta > hug_tol.clean.fasta

Next create an alignment from the cleaned file.  I always like to degap first, although it shouldn’t matter.

## Degap
seqmagick mogrify --ungap hug_tol.clean.fasta

## Align
cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam 16S_bacteria.cm hug_tol.clean.fasta

## Convert to fasta format
seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta

If any sequences are identical after alignment this will mess things up (you should never build a tree from a non-redundant alignment), so make non-redundant.

## Deduplicate sequences
seqmagick mogrify --deduplicate-sequences hug_tol.clean.align.fasta

Build the reference tree

At this point you should have a nice clean DNA alignment in the fasta format.  Now feed it to RAxML to build a tree.  Depending on the size of the alignment this can take a little bit.  I know you’ve read the entire RAxML manual, so of course you are already aware that adding additional cpus won’t help…

## Build tree without confidence scores, just to get stats for Taxtastic
raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345

I like having a sensibly rooted tree; it’s just more pleasing to look at.  You can do this manually, or you can have RAxML try to root the tree for you.

## Root the tree
raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre

Okay, now comes the tricky bit.  Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values.  The work around is to add the confidence scores to the already generated (rooted) tree, so that you a version of the tree with out without scores.  You will feed the scored tree to Taxtastic with the stats file from the unscored tree we already generated.

## Generate confidence scores for tree
raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta

Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.

taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre

Align the query reads

At this point you have the reference package and you can proceed with analyzing some query reads!  The first step is to align the query reads after combining them with the de-duplicated reference alignment*.

## Clean the names
tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < query.fasta > query.clean.fasta

## Concatenate the query reads and reference alignment
cat hug_tol.clean.align.fasta query.clean.fasta > query.hug_tol.clean.fasta

## Remove gaps
seqmagick mogrify --ungap query.hug_tol.clean.fasta

## Align
cmalign --dna -o query.hug_tol.clean.align.sto --outformat Pfam 16S_bacteria.cm query.hug_tol.clean.fasta

## Convert to fasta
seqmagick convert query.hug_tol.clean.align.sto query.hug_tol.clean.align.fasta

Phylogenetic placement

Now we’re on the home stretch, we can execute the phylogenetic placement itself!  The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?).  You can probably accept most of the flags for the previous commands as is.

pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta

At this point you have a file named query.hug_tol.clean.align.jplace.  You will need to use guppy to convert this json-format file to information that is readable by human.  The two most useful guppy commands (in my experience) for a basic look at your data are:

## Generate an easily parsed csv file of placements, with only a single placement reported for each
## query read.
guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace

## Generate a phyloxml tree with edges fattened according to the number of placements.
guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace

*A far more elegant solution would be to use esl-alimerge included with Infernal to merge the query and reference alignment, after aligning the query reads against 16S_bacteria.cm.  The problem with this is that esl-alimerge needs the original sto file produced by cmalign, and that file contains duplicate sequences not used to build the reference tree.  Easy enough to de-duplicate this file, but the sto format produced by Seqmagick is not compatible with esl-alimerge 🙁

35438 Total Views 2 Views Today
This entry was posted in Research. Bookmark the permalink.

20 Responses to Phylogenetic placement re-re-visited

  1. Francesco says:

    P.S: I forgot to mention that the posted link for the 16S_bacteria.cm file apparently does no works

  2. Francesco says:

    Hi Jeff, thank you very much for this tutorial, it is amazing. Only, I am quite new with this kind of analysis and I am not sure about the “Pfam 16S_bacteria.cm” file, where can I found and download it? In advance, thank you very much for your help. Fra

  3. Huda says:

    Hi

    This is a great tutorial. I have followed it through and have been reading the pplacer document along side. I am confused about one thing. I view the output from pplacer in iTOL and I cannot find my query sequences in there. The CSV file shows the edge number but how can I search them in the tree. If i search them with their names the search returns zero.

    Please help

    • Francesco Cicala says:

      I have the same problems, any suggestions?
      In advance thank for your help
      Fra

      • Avatar photo Jeff says:

        This is a good question for the phylogenetic placement Google group which is monitored by the pplacer and epa-ng developers. Note that pplacer is not under active development, you should be using epa-ng. I’m not very familiar with iTOL, but it sounds like you might be misinterpreting the use of the fat format trees. You won’t be able to search for individual reads on that tree, but you can parse the jplace file for this information.

  4. Laura says:

    Hello, I have a question. When you want to include taxonomy in your refpkg, is it possible to define a database different from NCBI?. I have checked http://fhcrc.github.io/taxtastic/quickstart.html and there says that I need a copy of the NCBI taxonomy, but if I want to use my own taxonomy. Is that possible?

    thanks in advance

    laura

    • Avatar photo Jeff says:

      I’ve never tried that, so long as it was structured the same way I don’t know why you couldn’t do that. Give it a try with a small test database. You might consider posting your query on the pplacer Google Group to see what Matsen et al. have to say.

  5. Confused says:

    Hi Jeff – Thank you for your informative post! Once you’ve created the tree, how can you determine which query sequences have been placed where (i.e. which query sequences have been placed on a specific edge?)? Thank you!

    • Avatar photo Jeff says:

      The csv file created by guppy_to_csv should tell you just that. Just look for the edge_num column.

      • Confused says:

        Thanks Jeff! I’m noticing that the edge that appears to have the most reads (i.e. the widest) after using guppy fat and visualizing the tree with Archaeopteryx , is an edge that only has one query sequence placed on it according to the guppy_to_csv file (as opposed to other edges which have 2-8 sequences). Do you have any ideas as to why this might be the case, or am I misunderstanding the function of guppy fat? Your help is very much appreciated!

        • Avatar photo Jeff says:

          Nothing much to suggest there, you’re interpreting the meaning of edge width correctly. Seems that something funny is going on…

  6. Cyril says:

    Hi Jeff,
    when I try the command: tr “[ -%,;\(\):=\.\\\*[]\”\’]” “_” hug_tol.clean.fasta

    I get an error message which says: tr: misplaced sequence asterisk

    Any ideas on what to do? I’m working on a mac.

    • Avatar photo Jeff says:

      Double check the command. That differs from the one specified…

      • Cyril says:

        You seem to be using quasi-regex character class syntax for tr. tr is much more limited in it’s scope than that. It only accepts a few escape characters and special characters. I asked for help on stackoverflow and they told me to simplify the command to:

        tr “%,;():=.*[]\”\’ \\\\\-” “_” hug_tol.clean.fasta

        Which worked for me. Might be helpful for other people using a mac.

  7. Jacob says:

    This is a really useful tutorial. I was wondering if there are some example `query.fasta` data available for following-along purposes. I didn’t see any linked here (though maybe I just missed them) and it would help just to make sure everything is working as expected before I run this on my actual data.
    Thanks!

    • Avatar photo Jeff says:

      Glad you found it useful. That’s a pretty old tutorial that is largely replaced by our paprica pipeline. You can find a link to a sample query file from one of the paprica tutorials linked here.

  8. Ramzi says:

    Hi Jeff,
    Thanks for posting this. I am trying to follow your tutorial to learn how to use pplacer. I am stuck at the step where I invoke cmalign it gives the following error:
    Error: Parse failed (sequence file hug_tol.clean.fasta):
    Line 2: illegal character _
    So it complains about the sequence that is starting with an underscore “_”. Just wondered what went wrong.
    Thanks

    • Avatar photo Jeff says:

      Cmalign doesn’t like it if the sequences already have gaps. Just remove the underscores and it should work fine. Good luck!

  9. Ali says:

    Why did you align to 16S_bacteria.cm and not silva.abe.fasta, as you did in your previous post?

    • Avatar photo Jeff says:

      I’ve found that Infernal and the bacteria covariance model produce a much better alignment. It’s also quite fast.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield