One of the most popular primer sets for 16S rRNA gene amplicon analysis right now is the 515F/806R set. One of the advantages of this pair is that it amplifies broadly across the domains Archaea and Bacteria. This reduces by half the amount of work required to characterize prokaryotic community structure, and allows a comparison of the relative (or absolute, if you have counts) abundance of bacteria and archaea. However, paprica and many other analysis tools aren’t designed to simultaneously analyze reads from both domains. Different reference alignments or covariance models, for example, might be required. Thus it’s useful to split an input fasta file into separate bacterial and archaeal files.
We like to use the Infernal tool cmscan for this purpose. First, you’ll need to acquire covariance models for the 16S/18S rRNA genes from all three domains of life. You can find those on the Rfam website, they are also included in paprica/models if you’ve downloaded paprica. Copy the models to new subdirectory in your working directory while combining them into a single file:
mkdir cms cat ../paprica/models/*cm > cms/all_domains.cm cd cms
Now you need to compress and index the covariance models using the cmpress utility provided by Infernal. This takes a while.
cmpress all_domains.cm
Pretty simple. Now you’re ready to do some work. The whole Infernal suite of tools has pretty awesome built-in parallelization, but with only three covariance models in the file you won’t get much out of it. Best to minimize cmscan’s use of cores and instead push lots of files through it at once. This is easily done with the Gnu Parallel command:
ls *.fasta | parallel -u cmscan --cpu 1 --tblout {}.txt cmscan/all_domains.cm {} > /dev/nul
Next comes the secret sauce. The command above produces an easy-to-parse, easy-to-read table with classification stats for each of the covariance models that we searched against. Paprica contains a utility in paprica/utilities/pick_16S_domain.py to parse the table and figure out which model scored best for each read, then make three new fasta files for each of domains Bacteria, Archaea, and Eukarya (the primers will typically pick up a few euks). We’ll parallelize the script just as we did for cmscan.
ls *.fasta | parallel -u python pick_16S_domain_2.py -prefix {} -out {}
Now you have domain-specific files that you can analyze in paprica or your amplicon analysis tool of choice!
These steps are now incorporated (with some improvements) into the paprica-pick_domain.py script and automatically executed if you use paprica-run.sh. Check out the scripts and wiki for more information: https://github.com/bowmanjeffs/paprica
Imma gonna need this one 😉