I’m happy to announce the release of paprica v0.4.0. This release adds a number of new features to our pipeline for evaluating microbial community and metabolic structure. These include:
- NCBI taxonomy information for each point of placement on the reference tree, including internal nodes.
- Inclusion of the domain Eukarya. This was a bit tricky and requires some further explanation.
Eukaryotic genomes are a totally different beast than their archaeal and bacterial counterparts. First and foremost they are massive. Because of these there aren’t very many completed eukaryotic genomes out there, particularly for singled celled eukaryotes. While a single investigator can now sequence, assemble, and annotate a bacterial or archaeal genome in very little time, eukaryotic genomes still require major efforts by consortia and lots of $$.
One way to get around this scale problem is to focus on eukaryotic transcriptomes instead of genomes. Because much of the eukaryotic genome is noncoding this greatly reduces sequencing volume. Since there is no such thing as a contiguous transcriptome, this approach also implies that no assembly (beyond open reading frames) will be attempted. The Moore Foundation-funded Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) was an initial effort to use this approach to address the problem of unknown eukaryotic genetic diversity. The MMETSP sequenced transcriptomes from several hundred different strains. The taxonomic breadth of the strains sequenced is pretty good, even if (predictably) the taxonomic resolution is not. Thus, as for archaea, the phylogenetic tree and metabolic inferences should be treated with caution. For eukaryotes there are the additional caveats that 1) not all genes coded in a genome will be represented in the transcriptome 2) the database contains only strains from the marine environment and 3) eukaryotic 18S trees are kind of messy. Considerable effort went into making a decent tree, but you’ve been warned.
Because the underlying data is in a different format, not all genome parameters are calculated for the eukaryotes. 18S gene copy number is not determined (and thus community and metabolic structure are not normalized), the phi parameter, GC content, etc. are also not calculated. However, eukaryotic community structure is evaluated and metabolic structure inferred in the same way as for the domains bacteria and archaea:
./paprica-run.sh test.eukarya eukarya
As always you can install paprica v0.4.0 by following the instructions here, or you can use the virtual box or Amazon Web Service machine instance.