The Database Dilemma

Microbial ecologists know they have a problem with data archiving, particularly when it comes to sequence data.  I’m not entirely sure why this is the case; in theory it would be pretty straightforward to build a database searchable by the parameters ecologists are interested in.  I suspect that part of the problem is a lack of interest on the part of NSF (the primary source of funding for ecology) in designing and maintaining databases, and the perception that this is duplicative of the (rather sad) efforts of the NIH-funded NCBI in this area.  I don’t want to be too disparaging of NCBI – they have a much harder job than the simple database I’ve outlined above – but the Sequence Read Archive (SRA) and other NCBI repositories are a near-total disaster.  We’re wasting a lot of collective time, money, and effort developing sequence data that can’t be used for future studies because it can’t be easily accessed.

In response to this a small number of alternate repositories have popped up.  In my experience however, they don’t offer much of an improvement.  This isn’t meant as a harsh criticism of various peoples’ good efforts, I just don’t think the community’s found a viable model yet.  One example is the VAMPS database operated by the Marine Biological Laboratory, which in addition to having an unfortunate name (I believe it was named before the onset of the angsty-teen vampire thing, try googling “VAMPS” and you’ll see what I mean), tries to do too much.  Rather than organizing data and making it searchable the designers opted to try and build an online analysis pipeline.  Instead of easy to find data you end up with an analytical black box and canned results.

The reason for all this soap-boxing is my experience this week trying to obtain some global marine 16S rRNA gene amplicon data for a project that I’m working on.  The data I had hoped to obtain was used in the 2014 PNAS paper Marine bacteria exhibit a bipolar distribution.  My difficulty in obtaining the data highlights failures at the author, reviewer, journal, and repository levels.  I have great respect for many of the authors on this paper, so again this isn’t meant as a harsh criticism (I’m sure I haven’t made data I’ve published sufficiently accessible either), but rather a lesson in something that we all need to do better at.  I’ll end with the Python-based solution I came up with for getting useful sequence data out of the VAMPS flat-file format.

The 2014 paper used 277 samples that were collected as part of the International Census of Marine Microbes (ICoMM).  There are quite a few things one can say about that particular endeavor but we’ll save it for another post.  The authors included the accession numbers for all these studies in a supplementary file.  This file is a pdf, so not the easiest thing to parse, but at least the text is renderable, so the (7-page) table can be copied into a text file and parsed further.  With a little bit of effort you can get a single-column text file containing the accession numbers for each of these studies.  Great.  Unfortunately a good number of these are wildly incorrect, and most of the remainder point to SRA objects that can’t be recognized by the woefully inadequately documented SRA toolkit.  Here are some examples:

SRX011052 – Enter this accession number into the SRA search bar and it will navigate you to a page for a sample with the helpful name “Generic sample from marine metagenome”.  Nevermind for a moment that “metagenome” is an inappropriate name for an amplicon dataset, where is the data?  Click the link for “Run” (with a different accession number: SRR027219) and navigate to a new page with lots of obscure information you don’t care about.  If you’re actually doing this you might, in near desperation, click the “Download” tab in the hope that this gets you somewhere.  Psych!  All that page has is a message that you need to use the SRA toolkit to download this data.  Let’s try the prefetch utility in the toolkit:

>prefetch SRR027219

2016-05-25T15:30:43 prefetch.2.4.2 err: path not found while resolving tree
within virtual file system module - 'SRR027219.sra' cannot be found.

Whatever that means.  There’s a trick to using this utility, which I’ll post as soon as I remember what it is.  The documentation suggests that the above command should work.  The absurdity of needing to rely on this utility in place of FTP usually drives me to use the EMBL-EBI site instead.  This is the Euro equivalent of Genbank, anything submitted to one should be accessible through the other.  In general EMBL-EBI is much better organized and more user friendly than SRA.

Entering “SRX011052” into the ENBL-EBI search bar returns some results that also point us to SRR027219.  The page for SRR027219 has such helpful things as an FTP link for direct download as well as some basic info on the sample.  That’s great, but because the authors opted to provide “Experiment” accession numbers for some samples (e.g. SRX011052) there is no way that I can see to generate a script to automate downloads, as we have no way of knowing the run accession numbers.  The solution is to tediously follow the links to all 277 runs.  I actually started doing this which lead me to my second problem.

SRP001129 – At some point in the supplementary table the accession numbers switch from a “SRX” prefix to “SRP”.  Picking one of these at random and entering it into the EMBL-EBI search bar returns a result for a genome sequencing project related to the Human Microbiome Project.  Clearly this isn’t from the ICoMM project.  Roughly a third of the accession numbers reported in the paper aren’t correct!

Clearly I wasn’t going to get the data that I needed through either EMBL-EBI or the SRA.  Knowing that the study had been part of the ICoMM I turned to the MBL VAMPS database where the ICoMM data is also archived.  Thinking that I was on the home stretch I followed the link to the ICoMM portal from the VAMPS homepage.  The right side column gave me two options for download; tar-zipped sff files or a tab-delimited file.  Not wanting to deal with converting from sff (the organic 454 file format) I took a look at the tab-delim option.  The top lines of the file look like this:

project dataset sequence        taxonomy        knt     rank    total_knt       frequency
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGACGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGA  Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  13      family  36172   0.000359394006
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCTAGAGAG      Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAAGGCCGACTGTTAGATGAAGATTCAGCGTAATGAGCTTATCGGATTTTCAGAGAG      Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTAGATGAAGACCAGTGTAACGAACTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAATGTGATGAACTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  2       family  36172   5.5291386e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTAATGAGCTTGTCGGATTTTCAGAGAG       Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05
ICM_ABR_Av6     ABR_0006_2005_01_07     GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTGATGAACTTGTCGGATTTTCAGAGAG   Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II  1       family  36172   2.7645693e-05

Okay, so this is a bizarre way to share sequence data but at this point I was willing to take it.  Basically the file gives us unique reads, their abundance (knt), and taxonomy according to VAMPS.  Want I needed to do was convert this into a separate, non-unique fasta file for each sample.  With a little help from Pandas this wasn’t too bad.  The vamps_icomm_surface.csv file specified below is just a csv version of the pdf table in the PNAS publication, after some tidying up.  The second row of the csv file (i.e. index 1 in Python land) is the sample name in VAMPS (I had to split the first column of the original table on ‘.’ to get this).

import pandas as pd

## get the metadata for surface samples, this was generated
## from the pdf table in the PNAS paper

metadata = pd.read_csv('vamps_icomm_surface.csv', index_col = 1)

## get the seqs, downloaded from https://vamps.mbl.edu/portals/icomm/data_exports/icomm_data.tar.gz
seqs = pd.read_csv('icomm_data.tsv', delimiter = '\t', index_col = 1)

## iterate across each study in metadata, making a temp dataframe
## of reads from that study

for study in metadata.index:
    temp = seqs.loc[study]
    temp_bacteria = temp[temp['taxonomy'].str.contains('Bacteria')]
    
    ## now iterate across each row in the temp dataframe, exporting
    ## the sequence the specified number of times to a fasta file
    
    with open(study + '.bacteria.fasta', 'w') as output:
        for row in temp.iterrows():
            sequence = row[1]['sequence']
            for i in range(1, row[1]['knt']):
                name = row[0] + '.' + str(i)
                print name
                print >> output, '>' + name
                print >> output, sequence

To end I’d like to circle back to the original dual problems of poor database design and erroneous accession numbers.  Although these problems were not directly connected in this case they exacerbated each other, magnifying the overall problem of data (non)reuse.  Some suggestions:

  1.  The SRA and other databases needs to aggregate the run accession numbers for each new study.  A couple of clicks should lead me to a download link for every sequence associated with a study, whether that study generated the raw data or only used it.  I don’t believe it is currently possible to do this within SRA; initiating a new project assumes that you will contribute new raw sequences (correct me if you think otherwise).  This should be easy to implement, and the association of a publication with a project should be rigorously enforced by journals, departments, and funding agencies.
  2. Accession numbers should be automatically checked by journals the way references are.  If it points to something that isn’t downloadable (or at least has a download link on the page) the author should correct it.  There is no excuse for incorrect, incomplete, or non-run accession numbers.  That this was allowed to happen in PNAS – a top-tier journal – is double frustrating.  And frankly, if you reviewed this paper you owe me beer for not spot checking the refs.
  3. Validating this information is the job of the reviewers.  Almost nobody does this.  Usually when I review a paper containing sequence data I’m the only reviewer that insists the data is available at the time of publication (I can’t think of a single case where I wasn’t the only reviewer to insist).  Not only is this a necessary check, but part of the job of the reviewer is to spot check the analysis itself.  Clearly, without data that’s not being done either…

And that’s enough ranting about data availability for one day, back to work…

Posted in Research | Leave a comment

Roadmap to Ocean Worlds: Polar microbial ecology and the search for totally normal life

Recently congress recommended that NASA create an Ocean Worlds Exploration Program whose primary goal is “to discover extant life on another world using a mix of Discovery, New Frontiers, and flagship class missions”.  Pretty awesome.  In February I was invited to participate on the science definition team for the Roadmap to Ocean Worlds (ROW) initiative.  The ROW is step one in NASA’s response to the congressional recommendation.  Over the last few months the science definition team has been meeting remotely to hash through some challenging questions around the central themes of identifying ocean worlds, defining their habitability, and understanding the life that might live in or on them.  This week we will have our first and only opportunity to meet in person as a team.  As one of very few biologists in the group, and possible the only ecologist, I’ll be giving a short talk on polar microbial ecology as it pertains to discovering and understanding life on other worlds.  As a way to organize my thoughts on the subject and prep for the talk I’m experimenting with writing out the main points of the presentation as an article.

I decided to title the talk (and this post) Polar microbial ecology and the search for totally normal life because it reflects an idea that I and others have been advocating for some time: there isn’t anything that special about life in extreme environments (gasp!).  Life functions there in pretty much the same way that it functions everywhere else.  One way to think of this is that polar microbes (and other extremophiles) are uniquely adapted but not unique; they follow the standard rules of ecology and evolution, including how they acquire energy and deal with stress.  The nuances of how they interact with their environment reflect the unique characteristics of that environment, but the microbes are generally kind of conventional.

Act I: Challenges and opportunities for polar microbes

As with microbes in any environment, polar microbes are presented with a range of challenges and opportunities that define how they live.  From a microbial perspective challenges are stressors, or things that cause damage.  Opportunities energy sources.  Very often these things are the same.  Take sunlight, for example.  This is the single most important source of energy on the planet, but get a little too much of it (or any of it, to be more accurate), and you start doing damage to yourself.  As a result you have to invest some of the acquired energy into offsetting this stress.  We can illustrate this with a very simple conceptual model.

stress_energyThe above figure shows a hypothetical (but plausible) relationship between energy and stress.  Energy is shown both on the x-axis and as the blue 1:1 line.  Stress is the orange line.  In this scenario as energy increases, stress increases logarithmically.  The energy that is available for growth (that which is not dedicated to offsetting the damage caused by stress) is energy – stress, or the difference between the blue and orange lines.  That looks like this:

biomass_stressThe units for biomass are, of course, arbitrary, what matters is the shape of the curve.  Past a critical energy value the amount of biomass an ecosystem can sustain actually decreases as energy increases.  At the point at which stress = energy, no biomass is being accumulated.  It isn’t a coincidence that the shape of the biomass curve is the same as experimentally determined temperature curves for bacterial isolated belonging to different temperature classes (below).  In those curves the energy in the system is increasing as temperature rises.  This enhances enzymatic reactions (producing more biomass) until there is so much energy that the system becomes disordered (enzymes denature).  This process is directly analogous to the one presented here, but takes place at the population level rather than the ecosystem level.

Figure from http://fssp.food.dtu.dk/Help/Histamine/Mp-Mm/mp-mm.htm.

Figure from http://fssp.food.dtu.dk/Help/Histamine/Mp-Mm/mp-mm.htm.  Temperature curves for a psychrophilic (optimized to low temperature) bacteria and a mesophilic (optimized to roughly human body temperature) relative.

One thing that I would like to make clear up front is that low temperature itself is not a stressor.  Low temperature alone doesn’t kill single-celled organisms, it only impedes their ability to deal with other stress.  Because of this energy and stress have a very complicated relationship with temperature in cold ecosystems.  For example:

  • As temperature decreases, energy decreases.
  • As energy decreases, stress decreases (remember that in our conceptual model stress is a function of energy).
  • As temperature decreases, access to substrate (and by proxy energy) increases.
  • As temperature decreases, inhibitory compounds (which are stress) increases.

So many of the effects of low temperature are in conflict.  Highly adapted psychrophiles play off these conflicts to establishes niches in cold environments.  We can illustrate this idea by focusing on the last two bullet points, which are specific to environments that contain ice.  Impure ice (which virtually all environmental ice is) does not form a solid structure when it freezes.  It forms a porous matrix, with the porosity determined by the concentration of solutes and the temperature.  You can actually view this in 3D using sophisticated X-ray tomography techniques, as shown in this figure from Pringle et al., 2009.

From Pringle et al., 2009

From Pringle et al., 2009.  X-ray tomography images of saline ice at different temperatures.  The gold represents pore spaces, which decrease in size as the temperature drops.

The gold spaces in this figure are pore spaces within the ice.  As the ice gets colder the pore spaces become smaller and the solutes contained in them become more concentrated.  That’s because whatever was dissolved in the starting material (e.g. salt, sugars, small particles) is excluded from the ice as it forms crystals, the excluded material ends up in the pore spaces.  The colder the ice, the smaller the spaces, the more concentrated the solutes.  Bacteria and ice algae that are also excluded into these spaces will experience a much higher concentration of nutrients, potential sources of carbon, etc.  This equates to more energy, which helps them offset the stress of being in a very salty environment.  In very cold, relatively pure ice, solute concentrations can be 1000x those of the starting material.  Neat, right?

Here’s what decreasing temperature actually means for a bacterial cell trying to make a go of it in a cold environment:

quick_blog_figOur earlier plot of biomass depicted cells in the “growth” phase shown here.  Only very early in the plot, when energy was very low, and at the very end, when stress was very high, did biomass approach zero.  That zero point is called the maintenance phase.  Maintenance phase happens when decreasing temperature suppresses the ability to deal with stress to the point that a cell cannot invest enough resources into creating biomass to reproduce.  The cell is investing all its energy in offsetting the damage caused by stress.

At increasingly low temperatures the synthesis of new biomass becomes increasingly less efficient, requiring a proportionally greater expenditure of energy (we say that bacterial growth efficiency decreases).  At the end of the maintenance phase the bacterial cell is respiring, that is it is creating energy, but this is resulting in virtually no synthesis or repair of cellular components.  This leads inevitably to cell death when enough cellular damage has accumulated.  It’s sad.

If we lift our heads up out of the thermodynamic weeds for a moment we can consider the implications of all this for finding life on another cold world (all the good ones are cold):

quick_blog_fig2On the assumption that everyone’s asleep by this point in the presentation I’m trying to be funny, but the point is serious.  You hear a lot of silly (my bias) talk in the astrobiology community about life at the extremes; how low can it go temp wise etc.  Who cares?  I’m not particularly interested in finding a maintenance-state microbial community, nor do we stand a very good chance of detecting one even if it was right under our (lander’s) nose.  The important contribution to make at this points is to determine where life is actively growing on the relevant ocean worlds.  Then we can try to figure out how to reach those places (no mean task!).  The point of this whole exercise it to find extant life, after all…

Act II: Where are polar microbes distributed and why?

I can think of 8 polar environments that are particularly relevant to ocean worlds:

  • Supraglacial environment (glacier surface)
  • Interstitial glacier environment (glacier interior)
  • Subglacial environment (where the glacier meets rock)
  • Sea ice surface (top’o the sea ice)
  • Interstitial sea ice (sea ice interior)
  • Sea ice-seawater interface (where the sea ice meets the seawater)
  • Water column below ice (the ocean! or a lake)
  • The sediment-water interface (where said lake/ocean meets mud)

Each of these environments has a range of stress, energy, and temperatures, and this to a large extent defines their ability to support biomass.  For the purpose of this discussion I’ll report biomass in units of bacteria ml-1.  To give some sense of what this means consider that what the waste-treatment profession politely calls “activated sludge” might have around 108 bacteria ml-1.  Bottled water might have 103 bacteria ml-1.  Standard ocean water ranges between 104 bacteria ml-1 and 106 bacteria ml-1 at the very highest end.  One further note on biomass… our conceptual model doesn’t account for grazing, or any other trophic transfer of biomass, because that biomass is still in the biological system.  Bacteria ml-1 doesn’t reflect this, because bacteria are consumed by other things.  So the values I give for the ecosystems below are at best a loose proxy for the capacity of each ecosystem to support biomass.

From left to right, top to bottom:

From left to right, top to bottom: supraglacial environment (http://www.swisseduc.ch), interstitial glacier environment (Price, 2000), subglacier environment – arguably a subglacial lake, but a very shallow one with interface characteristics (www.wissard.org), sea ice surface (hey, that’s me!), sea ice interior (Krembs et al., 2002), sea ice-seawater interface (I.A. Melnikov), water below ice (from our last season at Palmer Station), sediment-water interface (www.polartec.com).

Let’s consider the sea ice-seawater interface and the sea ice surface in a little more detail.  The sea ice-seawater interface is interesting because it has, among all polar microbial environments, probably the greatest capacity to support biomass (we’re talking about summertime sea ice, of course).  It is located at the optimum balance point between energy and stress, as shown in the figure below.  Despite the fact that the sea ice surface has relatively high biomass, it is located at the extreme right side of the plot.  The trick is that biomass has accumulated at the sea ice surface as a result of abiotic transport, not in situ growth.  I explored this pretty extensively back in the very early days of my PhD (see here and here).

polar_microbes

Despite the fact that the sea ice surface doesn’t function as a microbial habitat, the concentration of biomass there might be relevant to our goal of finding extant life on another world.  The accumulation of bacteria at the ice surface is largely the result of bacteria being passively transported with salt.  Recall that saline ice is a porous matrix containing a liquid brine.  All of the bacteria in the ice are also contained in the liquid brine; as the ice cools the pore spaces get smaller, forcing some of the brine and bacteria to the ice surface.  Porous ice in say, the Europan ice shell would act similarly.  Salt is easier to identify than life (in fact we know that there are large salt deposits on the surface of Europa), so we can target salty areas for deeper search.  In astrobiology we like to “follow things” (because we get lost easy?); “follow the water” and “follow the energy” are often cited and somewhat useful axioms.  So here we have a “follow the salt” situation.

Act III: Can we explore polar microbial ecology in life detection?

Act II ended on a life detection note, so let’s follow this through and consider other details of polar microbial ecology that might give us clues of what to look for if we want to identify life.  Getting ecology to be part of this discussion (life detection) is non-trivial.  Despite the fact that the whole field of astrobiology is really an elaborate re-visioning of classic ecology (who lives where and why), there isn’t a whole lot of interest in studying how organisms interact with their environment.  This simply reflects the field’s disciplinary bias; the most active communities within astrobiology may well be astronomers and planetary geologists.  If the most active community was ecologists we’d probably be ignoring all kinds of important stuff about how planets are formed, etc.  To illustrate the utility of ecology for life detection here are two examples of ecological principles that might lead to life detection techniques.

Case 1 – Biological alteration of the microstructure of ice

From Krembs et al., 2002.

From Krembs et al., 2002.  This sea ice diatom has filled the pore space with EPS as a buffer against high salinity and as a cryoprotectant.

Earlier I put forward that decreasing temperatures can lead to heightened stress because life in ice is exposed to higher concentrations of damaging substances at lower temperatures.  The most obvious example is salt.  In very cold sea ice the pore space salinity approaches 200 ppt, that’s roughly 6 times the salinity of seawater!  Maintaining adequate internal water is a huge challenge for a cell under these conditions.  One of the mechanisms for dealing with this is to produce copious amounts of exopolysaccharides (EPS), a hydrated gel (essentially mucous) containing polysaccharides, proteins, and short peptides.  EPS buffers the environment around a cell, raising the activity of water and, in some cases, interacting with ice.  This produces a highly modified internal structure, as shown in the images below.  This alteration could be a useful way of identifying ice on an ocean world that has been modified by a biological community.

From Krembs et al., 2011.

From Krembs et al., 2011.

Case 2: Motility

Motility has been put forward before as an unambiguous signature of life, but the idea hasn’t really gained a lot of traction.  Clearly one needs to be cautious with this – Brownian motion can look a lot like motility – but I can’t think of anything else that life does that is as easily distinguishable from abiotic processes.  One additional challenge however, is that not all life is motile.  Plants aren’t motile, at least most of them over the timescales that we care to stare at them for.  Not all microbes are motile either, but I would argue that those that aren’t aren’t only because others are.

From Stocker, 2012.

From Stocker, 2012.

Consider the figure at right, which is a cartoon of two modes of bacterial life in the ocean.  One of those modes is motile, and can be seen using flagella to follow chemical gradients (we call this chemotaxis) and optimize their location with respect to phytoplankton, their source of carbon.  The second mode is much smaller and in the background; small-bodied non-motile cells that live on the diffuse carbon that they opportunistically encounter.  This works because that would be an inefficient niche for motile bacteria to exploit.  In the absence of motility however, chemical gradients constitute a very strong selective pressure to evolve motility.  We can see evidence of this in the convergent evolution of flagellar motility (and other forms of motility) in all three domains of life.  Although they may share a common chemotaxis sensory mechanism, the Bacteria, Archaea, and Eukarya all seem to have evolved flagellar motility independently.  This means that it’s probably a pretty good feature to have, and is likely to be shared by microbes on other ocean worlds.

That was quite a lot, so to summarize:

  • Microbial communities are oriented along gradients of energy and stress.  At some optimal point along that gradient a maximum amount of biomass can be supported by surplus energy that is not being used to deal with stress (How much energy do you spend on stress?  Think of how much more biomass you could have!).
  • The relationships between energy, stress, and temperature are complicated, but Earth life generally works at T > -12 °C.  This estimate is probably a little lower than the reality, because laboratory observations of growth at that temperature don’t accurately reflect environmental stress.
  • Life is strongly biased towards surfaces and interfaces, these may provide enhanced opportunities for life detection (follow the salt!).
  • The specific ecology of cold organisms can provide some further insights into life detection strategies.  For example, motility might be an under-appreciated signature of life.

 

Posted in Research | Leave a comment

Phylogenetic placement re-re-visited

I use phylogenetic placement, namely the program pplacer, in a lot of my publications.  It is also a core part of of the paprica metabolic inference pipeline.  As a result I field a lot questions from people trying to integrate pplacer into their own workflows.  Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users.  In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al.  To follow along with the tutorial start by downloading the sequences here.

You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement.  I strongly recommend using RAxML and Infernal.  After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees.  You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect.  This is double the case for this tutorial… I haven’t come across a covariance model for the 16S/18S gene across domains (if you have please post a comment!), so we will use the covariance model for the domain Bacteria.  You can find that at the Rfam database here.  For the purpose of this tutorial the cm has been renamed “16S_bacteria.cm”.  While you’re installing software I also recommend the excellent utility Seqmagick.

Although this tutorial was developed specifically for the 16S rRNA gene, the method is easily adapted to any DNA or protein sequence.  Simply swap out the cmalign command in Infernal for the hmmalign command in HMMER3.0, and download (or create) an appropriate HMM.  You may also need to swap out the model used by RAxML.

The workflow will follow these steps:

  1. Create an alignment of the reference sequences with Infernal
  2. Create a phylogenetic tree of the alignment
  3. Create a reference package from the alignment, tree, and stats file
  4. Proceed with the phylogenetic placement of your query reads

Create an alignment of the reference sequences

The very first thing that you need to do is clean your sequence names of any wonky punctuation.  This is something that trips up almost everyone.  You should really have no punctuation in the names beyond “_”, and no spaces!

tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < hug_tol.fasta > hug_tol.clean.fasta

Next create an alignment from the cleaned file.  I always like to degap first, although it shouldn’t matter.

## Degap
seqmagick mogrify --ungap hug_tol.clean.fasta

## Align
cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam 16S_bacteria.cm hug_tol.clean.fasta

## Convert to fasta format
seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta

If any sequences are identical after alignment this will mess things up (you should never build a tree from a non-redundant alignment), so make non-redundant.

## Deduplicate sequences
seqmagick mogrify --deduplicate-sequences hug_tol.clean.align.fasta

Build the reference tree

At this point you should have a nice clean DNA alignment in the fasta format.  Now feed it to RAxML to build a tree.  Depending on the size of the alignment this can take a little bit.  I know you’ve read the entire RAxML manual, so of course you are already aware that adding additional cpus won’t help…

## Build tree without confidence scores, just to get stats for Taxtastic
raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345

I like having a sensibly rooted tree; it’s just more pleasing to look at.  You can do this manually, or you can have RAxML try to root the tree for you.

## Root the tree
raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre

Okay, now comes the tricky bit.  Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values.  The work around is to add the confidence scores to the already generated (rooted) tree, so that you a version of the tree with out without scores.  You will feed the scored tree to Taxtastic with the stats file from the unscored tree we already generated.

## Generate confidence scores for tree
raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta

Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.

taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre

Align the query reads

At this point you have the reference package and you can proceed with analyzing some query reads!  The first step is to align the query reads after combining them with the de-duplicated reference alignment*.

## Clean the names
tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < query.fasta > query.clean.fasta

## Concatenate the query reads and reference alignment
cat hug_tol.clean.align.fasta query.clean.fasta > query.hug_tol.clean.fasta

## Remove gaps
seqmagick mogrify --ungap query.hug_tol.clean.fasta

## Align
cmalign --dna -o query.hug_tol.clean.align.sto --outformat Pfam 16S_bacteria.cm query.hug_tol.clean.fasta

## Convert to fasta
seqmagick convert query.hug_tol.clean.align.sto query.hug_tol.clean.align.fasta

Phylogenetic placement

Now we’re on the home stretch, we can execute the phylogenetic placement itself!  The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?).  You can probably accept most of the flags for the previous commands as is.

pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta

At this point you have a file named query.hug_tol.clean.align.jplace.  You will need to use guppy to convert this json-format file to information that is readable by human.  The two most useful guppy commands (in my experience) for a basic look at your data are:

## Generate an easily parsed csv file of placements, with only a single placement reported for each
## query read.
guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace

## Generate a phyloxml tree with edges fattened according to the number of placements.
guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace

*A far more elegant solution would be to use esl-alimerge included with Infernal to merge the query and reference alignment, after aligning the query reads against 16S_bacteria.cm.  The problem with this is that esl-alimerge needs the original sto file produced by cmalign, and that file contains duplicate sequences not used to build the reference tree.  Easy enough to de-duplicate this file, but the sto format produced by Seqmagick is not compatible with esl-alimerge 🙁

Posted in Research | 20 Comments

Another victim of science funding

Or rather the lack thereof.  I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model.  The email is posted below in its entirety (to assist in the request to spread the word).  With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG).  I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive.  I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest.  While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court.  In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.

As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review.  At its best the peer review process is an effective arbitrator of research spending.  I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review.  This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years.  Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.

My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes.  Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question.  Without these tools I simply would have had nowhere to start my exploration.  I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).

Viva community science!

************************

Dear Colleague,

I am writing to request your support as we begin a new chapter in the development of BioCyc.

In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens.  Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies.  Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.

Our Goal

Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.

Our work on EcoCyc has demonstrated the way forward.  EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications.  EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model.  EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.

Our goal is to develop similar high-quality databases for other organisms.  BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly.  Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information.  The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.

EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.

In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape.  Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace.  That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases.  Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.

Why Do We Seek Financial Support from the Scientific Community?

The EcoCyc project has been fortunate to receive government funding for its development since 1992.  Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.

Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development.  He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases [1].  Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.

Put another way: the Haemophilus influenzae genome was sequenced in 1995.  Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen.  Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year.  No curated database exists for the important gram-positive model organism Bacillus subtilis.  How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?

How it Will Work and How You Can Support BioCyc

The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one.  We are now turning to a similar model to support the curation and operation of BioCyc.  We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution.  We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.

We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world.  We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.

BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model.  Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported [2] by users.

Phoenix Bioinformatics  will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.

We plan to go slow with this transition to give our users time to adapt.  WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.

Access to the EcoCyc and MetaCyc databases will remain free for now.

Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals.  One subscription will grant access to all of BioCyc.  To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.

Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them.  We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.

Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.

Peter Karp

Director, SRI Bioinformatics Research Group

[1] http://www.nature.com/nature/journal/v527/n7576_supp/full/527S16a.html

[2] http://database.oxfordjournals.org/content/2016/baw018.long

Posted in Uncategorized | Leave a comment

Tutorial: Annotating metagenomes with paprica-mg

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Starting with version 3.0.0 paprica contains a metagenomic annotation module.  This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database.  The module produces as output:

  1.  Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
  2. A tally of the occurrence of each EC number in the sample, with some useful supporting information.
  3. Optionally, the metabolic pathways likely to be present within the sample.

In addition to the normal paprica-run.sh dependencies paprica-mg requires DIAMOND Blastx.  Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH.  If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software.  See the notes here.

Obtain the paprica-mg database

There are two ways to obtain the paprica-mg database.  You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and paprica-mg.ec.csv.gz (large!) to the ref_genome_database directory in the paprica directory.  Be sure to gunzip paprica-mg.ec.csv.gz before continuing.

## Navigate to wherever you installed the paprica database
cd ~/paprica/ref_genome_database

## Download the tarball the contains the paprica-mgt database
wget https://www.polarmicrobes.org/extras/paprica-mgt.database.tgz

## Untar
tar -xzvf paprica-mgt.database.tgz

Alternatively, if you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database.  Then build the paprica-mgt database as such:

paprica-mgt_build.py -ref_dir ref_genome_database

If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself.  Likewise you don’t need to be in the paprica directory to execute the paprica-mg_run.py script.

Annotate a metagenome

Once you’ve downloaded or  built the database you can run your analysis.  It is worth spending a little time with the DIAMOND manual and considering the parameters of your system.  To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):

## Download a test metagenome
wget https://www.polarmicrobes.org/extras/ERR318619_1.qc.fasta.gz

## Execute paprica-mg for EC annotation only
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F

This will produce the following output:

test.annotation.csv: The number of hits in the metagenome, by EC number.  See the paprica manual for a complete explanation of columns.

test.paprica-mg.nr.daa: The DIAMOND format results file.  Only one hit per read is reported.

test.paprica-mg.nr.txt: A text file of the DIAMOND results.  Only one hit per read is reported.

Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this.  I’ve tried to balance speed with accuracy in paprica-mg.  If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction.  On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database.  Of course most reads will have far fewer (if any) hits.  The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present.  For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B.  When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements.  Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway.  So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B.  The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized.  To predict pathways during annotation:

## execute paprica-mg for EC annotation and pathway prediction
paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local

In addition to the files already mentioned, you will see:

test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.

test.pathways.txt: A simple text file of all the pathways that were predicted.

test.paprica-mg.txt: A very large text file of all hits for each read.  You probably want to delete this right away to save space.

test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read.  You probably want to delete this right away to save space.

testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports.  It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.

 

Posted in paprica | Leave a comment

Antarctic wind-driven bacterial dispersal paper published

Frost flowers over young sea ice in the central Arctic Ocean. Photo Mattias Wietz.

Frost flowers over young sea ice in the central Arctic Ocean. Photo Mattias Wietz.

I’m happy to report that one of the appendices in my dissertation was just published in the journal Polar Biology.  The paper, titled Wind-driven distribution of bacteria in coastal Antarctica: evidence from the Ross Sea region, was a long time in coming.  I conceived of the idea back in 2010 when it looked like my dissertation would focus on the microbial ecology of frost flowers; delicate, highly saline, and microbially enriched structures found on the surface of newly formed sea ice.  Because marine bacteria are concentrated in frost flowers we wondered whether they might serve as source points for microbial dispersal.  This isn’t as far-fetched as it might seem; bacteria are injected into the atmosphere through a variety of physical processes, from wind lofting to bubble-bursting, and frost flowers have been implicated as the source of wind deposited sea salts in glaciers far from the coast.

At the time we’d been struggling to reliably sample frost flowers at our field site near Barrow, Alaska.  Frost flowers form readily there throughout the winter, but extremely difficult sea ice conditions make it hard to access the formation sites.  We knew that there were more accessible formation sites in the coastal Antarctic, so we initiated a one year pilot project to sample frost flowers from McMurdo Sound.  By comparing the bacterial communities in frost flowers, seawater, sea ice, terrestrial snow, and glaciers, we hoped to show that frost flowers were a plausible source of marine bacteria and marine genetic material to the terrestrial environment.  Because the coastal Antarctic contains many relic marine environments, such as the lakes of the Dry Valleys, the wind-driven transport of bacteria from frost flowers and other marine sources could be important for continued connectivity between relic and extant marine environments.

Frost flowers are readily accessible in McMurdo Sound throughout the winter, however, this does not mean that one can simply head out and sample them.  While the ice conditions are far more permissible than at Barrow, Alaska, the bureaucracy is also far more formidable.  The can-do attitude of our Inupiat guides in Barrow (who perceive every far-out field plan as a personal challenge) was replaced with the inevitable can’t-do attitude at McMurdo (this was 2011, under the Raytheon Antarctic Support Contract, and does not reflect on the current Lockheed Antarctic Support Contract, not to suggest that this attitude doesn’t persist).  Arriving in late August we were initially informed that our plan was much to risky without helicopter support, and that nothing could be done until mid-October when the helicopters began flying (we were scheduled to depart late October).  Pushing for a field plan that relied on ground transport ensnared us in various catch-22’s, such as (paraphrased from an actual conversation):

ASC representative: You can’t take a tracked vehicle to the ice edge, they’re too slow.

Me: Can we take a snowmobile to the ice edge?  That would be faster.  We do long mid-winter trips in the Arctic and it works out fine.

ASC representative: No, because you have to wear a helmet, and the helmets give you frostbite.  So you can only use a snowmobile when it’s warm out.

Ultimately we did access the ice edge by vehicle several times before the helicopters started flying, but the samples reported in this publication all came from a furious two week period in late October.  What we found really surprised us.

Sampling frost flowers is as easy as scraping them from the ice surface with a clean shovel into a sterile plastic bin.

Sampling frost flowers on a warm day in September, 2011, in McMurdo Sound.

Pull very short ice cores from young sea ice after sampling frost flowers. You can see one of the short cores in the lower right of the image.

Sampling some of the frost flowers used in this study over much thinner ice north of Ross Island in October, 2011.

Rotor wash on a smooth ice surface leads can lead to lots of scattered gear. On any tenous surface the helicopter unloads us "hot", without shutting down, meaning that the first few minutes at a sampling site are a loud, confusing scramble to get everything unloaded and protected before the helicopter takes off. Here we sort ourselves out after the helicopter departs. Photo: Shelly Carpenter.

Sampling “blue ice” on Taylor Glacier a few days earlier.  Comparing various terrestrial ice environments with the marine samples gave us some interesting insights on how bacteria are dispersed by strong winter winds in the coastal Antarctic.

There is ample evidence for the wind-driven transport of bacteria in this region but, contrary to our hypothesis, most of that material is coming from the terrestrial environment.  The major transportees were a freshwater cyanobacterium from the genus Pseudanabaena and a set of sulfur-oxidizing Gammaproteobacteria (GSO).  The cyanobacterium was pretty easy to understand; it forms mats in a number of freshwater lakes and meltponds in the region.  In the winter these freeze, and since snow cover is low, ice and microbial mats are ablated by strong winter winds.  Little pieces of mat are efficiently scattered all over, including onto the sea ice surface.

Microbial mat in Lake Chad, 2005 (photo by A. Chiuchiolo), taken from http://www.montana.edu/priscu/antarcticimages/dry_valley_lakes.html.

Easily dispersed: desiccated microbial mat in Lake Chad, 2005
(photo by A. Chiuchiolo), taken from http://www.montana.edu/priscu/antarcticimages/dry_valley_lakes.html.

The GSO threw us for more of a loop; the most parsimonious explanation for their occurrence in frost flowers is that they came from hydrothermal features on nearby Mt. Erebus.  We did some nice analysis with wind vectors in the region and while you don’t get a lot of wind (integrated over time) to move material from Mt. Erebus to our sample sites, you do get some occasional very strong storms.

Wind velocity and magnitude for the study region in October, 2011. Taken from Bowman and Deming, 2016.

Wind velocity and magnitude for the study region in October, 2011. Taken from Bowman and Deming, 2016.

What all this means is that, consistent with other recent findings, there is high regional dispersal of microbes around the coastal Antarctic.  While I’m sure there are some endemic microbes occupying particularly unique niches, in general I expect microbes found in one part of the coastal Antarctic to be present in a similar environment in a different part of the coastal Antarctica.  There are however, quite a few ways to interpret this.  Bacteria and Archaea can evolve very fast, so the genome of a clonal population of (for example) wind deposited Pseudanabaena newly colonizing a melt pond can diverge pretty fast from the genome of the parent population.  This has a couple of implications.  First it means that the coastal Antarctic, with all it’s complex topography yet high degree of microbial connectivity, is an excellent place to explore the dynamics of microbial adaptation and evolution, particularly if we can put constraints on the colonization timeline for a given site (non trivial).  Second, it raises some questions about the propriety of commercially relevant microbes obtained from the continent.  The commercialization of the continent is probably inevitable (I hope it is not), perhaps the potential ubiquity of Antarctic microbes will provide some defense against the monopolization of useful strains, enzyme, and genes.

Posted in Research | Leave a comment

Tutorial: Building the paprica database

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Building the paprica database provides maximum flexibility but involves more moving parts and resources than using the provided database.  Basic instructions for using the paprica-build.sh script are provided in the manual, this tutorial is intended to provide an even more detailed step-by-step guide.

Requirements

While a laptop running Linux, Windows with VirtualBox, or OSX is perfectly adequate for analysis with paprica, you’ll need something a little beefier for building the database (unless you’re really patient).  A high performance cluster is overkill; I build the provided database on a basic 12 core Linux workstation with 32 Gb RAM (< $5k).  Something in this ballpark should work fine, of course more cores will get the job done faster (but keep an eye on memory useage).

Once you’ve got the hardware requirements sorted out you need to download the dependencies.  I recommend first following all the instructions for the paprica-run.sh script, then installing RAxML, pathway-tools, and taxtastic.  The rest of this tutorial assumes you’ve done just that, including running the test.bacteria.fasta file against the bacteria database:

./paprica-run.sh test.bacteria bacteria

Install Remaining Dependencies

In addition to all the dependencies required by paprica-run.sh, you need pathway-tools and RAxML.  These are very mainstream programs, but that doesn’t necessarily mean installation is easy.  In particular pathway-tools requires that you request a license (free for academic users).  This takes about 24 hours after which you’ll receive a link to download the installer.  Regardless of whether you’re sitting at the workstation or accessing it via SSH, a GUI will pop up and guide you through the installation.  In general you can accept the defaults, however, the GUI will ask you where pathway-tools should create the ptools-local directory.  This is where the program will create the pathway-genome databases that describe (among other things) the metabolic pathways in each genome.  By the time you are done creating the database this directory will be > 100 Gb, so pick a location with plenty of space!  This might not be your home directory (the default location).  For example on my system my home directory is housed on a small SSD.  To keep the home directory from becoming bloated I opted to locate ptools-local on a separate drive.

You will receive a number of download options from the pathway-tools development team.  I recommend that you conduct only the basic installation of pathway-tools (this is the EcoCyc and MetaCyc option), and do not download and install additional PGDBs.  Nothing wrong with installing these additional, well-curated PGDBs other than increased space and time, but they become ponderous.  You can always add them later if you want to become a metabolic modeling rock star.

Once you’ve installed pathway-tools you should be sure to add the program to your PATH, following standard methods.  Once you’ve done this re-source .profile and type pathway-tools in a bash shell.  The GUI should open.

RAxML is the one piece of software used by paprica that requires compilation.  Fortunately the RAxML folks generally build good software, so compiling isn’t likely to be a problem.  RAxML comes in several flavors and paprica is a bit particular about the version of RAxML it expects to find.  RAxML gets called in two scripts; paprica-get_ref.py and paprica-place_it.py.  These scripts call “raxmlHPC-PTHREADS-AVX2”, so you need to make sure you build the parallel threaded AVX2 version, i.e.:

make -f Makefile.AVX2.PTHREADS.gcc

If you need to not do that due to hardware limitations (a very old workstation or some such) you’ll need to modify those scripts accordingly or actually name the working RAxML command raxmlHPC-PTHREADS-AVX2 (I cringe at this suggestion, but it is the simplest solution).  Get in contact with me if you have issues.  As with all dependencies, after you build raxmlHPC-PTHREADS-AVX2 you need to add the RAxML directory to your PATH, re-source .profile, and test the installation by typing raxmlHPC-PTHREADS-AVX2 in a bash shell.  If RAxML yells at you with a warning it installed correctly and you’re good to go.

Test paprica-build.sh

There are a lot of moving parts in paprica-build.sh.  Because of this, and because of the amount of time certain steps take to complete, troubleshooting can be a little frustrating.  To make things easier you can download a fake ref_genome_database directory here with 11 genomes pre-loaded (10 in the bacteria/refseq and 1 in user/bacteria).  Download, remove the old ref_genome_database directory that came with paprica, and untar the new one:

rm -r ref_genome_database
wget https://www.polarmicrobes.org/extras/ref_genome_database.tgz
tar -xzvf ref_genome_database.tgz

At this point in time it is absolutely essential that you open paprica-build.sh and switch the -download flag in the paprica-make_ref.py line from “T” to “test”.  If you don’t do this you will initiate a fresh download of all the completed genomes in Genbank.  Make sure you switch this flag back to T when you are done testing and ready to actually build the database.

Once you’ve downloaded the test ref_genome_database directory and switched the flag simply execute the script:

./paprica-build.sh bacteria

You’ll see lots of output flash by on the screen.  Keep an eye out for error messages that could indicate something amiss.  If something goes wrong the error messages should be quite obvious as the later steps will fail completely.  I have noticed that for reasons which are not yet clear, when building the full database the script sometimes hangs after PGDB creation.  If that happens use control-c to exit the bash script, and re-execute the paprica-build_core_genomes.py script.  You can re-execute the whole bash script if you like, but it takes some time and isn’t necessary.  If you start seeing messages like:

collecting data for internal node 13, 1 of 9
collecting data for internal node 19, 2 of 9
collecting data for internal node 16, 3 of 9
collecting data for internal node 6, 4 of 9
collecting data for internal node 2, 5 of 9
collecting data for internal node 5, 6 of 9
collecting data for internal node 15, 7 of 9
collecting data for internal node 9, 8 of 9
collecting data for internal node 14, 9 of 9

…you can relax, you’re at the end and nothing went amiss.  This small collection of genomes includes the CCG for the reads in test.bacteria.fasta, so you can even test your new “database” with paprica-run.sh:

./paprica-run.sh test.bacteria bacteria

Add Custom Draft Genomes (Optional)

V0.3.0 of paprica introduced the ability to add custom draft genomes.  These genomes add additional CCGs to your reference tree and increase the accuracy of the metabolic inference.  Because they are not necessarily complete however, they are not used to calculate genome parameters such as the number of 16S rRNA genes or the length of the genome.  Placements to these edges should produce NA values in the edge_data.csv file produced by paprica-run.sh.  If you don’t want to add custom draft genomes you don’t need to do anything at this point.  If you want to add draft genomes you should create a unique directory for each, based on accession number, in ref_genome_database/user/bacteria or ref_genome_database/user/archaea (as appropriate).  For example, to add several draft Sulfitobacter genomes my directory structure looks like this:

me@computer:/home/me/paprica/ref_genome_database/user/bacteria$ls
draft.combined_16S.fasta  GCF_000152645.1  GCF_000622325.1  GCF_000622365.1  GCF_000622405.1  GCF_000647675.1  GCF_000735125.1
GCF_000152605.1           GCF_000620505.1  GCF_000622345.1  GCF_000622385.1  GCF_000622425.1  GCF_000712315.1

Inside each of these directories are the .fna file for the draft genome and the genomic .gbff file (use that extension, not .gbk).  So like this:

me@computer:/home/me/paprica/ref_genome_database/user/bacteria/GCF_000152645.1$ ls
GCF_000152645.1_ASM15264v1_genomic.fna  GCF_000152645.1_ASM15264v1_genomic.gbff

When you build the database paprica-make_ref.py will automatically look in the user directory and attempt to use anything it finds there.  Your draft genome must have a 16S rRNA gene, which not all of them do.  If it does not the build will continue without your draft genome.

Add Custom EC Numbers

If EC numbers are reported for genomes that you’re pretty sure should be there (annotations aren’t perfect) you can let paprica know about this by modifying the user_ec.csv file in ref_genome_database/user.  Open the file in a text editor and you should see this:

##    This file can be used to indicate enzymes by EC number that you
##    believe should be in a genome, but do not appear in the genome.
##    These EC numbers will propagate to internal nodes, but will not
##    be used in pathway prediction.  The first row of this file is a
##    header and should not be modified.  Rows should be numbered
##    sequentially.
##
##    Example rows:
##    1,GCF12345,1.1.1.1,happy enzyme
##    2,GCF54321,2.3.4.5,sad enzyme
##
,GI_number,EC_number,product

Follow the instructions to add new enzymes to your database.  These enzymes will propagate to internal nodes every time you update your database, but they will not be used in pathway prediction.

Build the Database

If you tested paprica-build.sh as described above (you should!) make sure you switch the -download flag back to T.  This will tell paprica to initiate a fresh download of genomes from Genbank.  Don’t worry about the ref_genome_database directory you used for testing, it is fully compatible with the new download.  Once you’ve done this initiate a build of the real database:

./paprica-build.sh bacteria

Of course if you wanted to build the archaeal database you would replace “bacteria” with “archaea”.

Cautionary Notes

One of the particularly time consuming steps in paprica-build.sh is pathway prediction by pathway-tools.  In addition to taking some time (it has a lot of work to do) pathway-tools needs to send you graphical messages.  You can ignore these, but if you close the SSH session progress will stop because pathway-tools has no place to send the messages.

Posted in paprica | Leave a comment

Correctly evaluating metabolic inference methods

Last week I gave a talk at the biennial Ocean Sciences Meeting that included some results from analysis with paprica.  Since paprica is a relatively new method I showed the below figure to demonstrate that paprica works.  The figure shows a strong correlation for four metagenomes between observed enzyme abundance and enzyme abundance predicted with paprica (from 16S rRNA gene reads extracted from the metagenome).  This is similar to the approach used to validate PICRUSt and Tax4Fun.

Spearman's correlation between predicted and observed enzyme abundance in four marine metagenomes.

Spearman’s correlation between predicted and observed enzyme abundance in four marine metagenomes.

The correlation looks decent, right?  It’s not perfect, but most enzymes are being predicted at close to their observed abundance (excepting the green points where enzyme abundance is over-predicted because metagenome coverage is lower).

After the talk I was approached by a well known microbial ecologist who suggested that I compare these correlations to correlations with a random collection of enzymes.  His concern was that because many enzymes (or genes, or metabolic pathways) are widely shared across genomes any random collection of genomes looks sort of like a metagenome.  I gave this a shot and here are the results for one of the metagenomes used in the figure above.

Correlation between predicted and observed (red) and random and observed (black) enzyme abundances.

Correlation between predicted and observed (red) and random and observed (black) enzyme abundances.

Uh oh.  The correlation is better for predicted than random enzyme abundance, but rho = 0.7 is a really good correlation for the random dataset!  If you think about it however, this makes sense.  For this test I generated the random dataset by randomly selecting genomes from the paprica database until the total number of enzymes equaled the number predicted for the metagenome.  Because there are only 2,468 genomes in the current paprica database (fewer than the total number of completed genomes because only one genome is used for each unique 16S rRNA gene sequence) the database gets pretty well sampled during random selection.  As a result rare enzymes (which are also usually rare in the metagenome) are rare in the random sample, and common enzymes (also typically common in the metagenome) are common.  So random ends up looking a lot like observed.

It was further suggested that I try and remove core enzymes for this kind of test.  Here are the results for different definitions of “core”, ranging from enzymes that appear in less than 100 % of genomes (i.e. all enzymes, since no EC numbers appeared in all genomes) to those that appear in less than 1 % of genomes.

The difference between the random and predicted correlations does change as the definition of the core group of enzymes changes.  Here’s the data aggregated for all four metagenomes in the form of a sad little Excel plot (error bars give standard deviation).

delta_correlationThis suggests to me a couple of things.  First, although I was initially surprised at the high correlation between a random and observed set of enzymes, I’m heartened that paprica consistently does better.  There’s plenty of room for improvement (and each new build of the database does improve as additional genomes are completed – the last build added 78 new genomes, see the current development version) but the method does work.  Second, that we obtain maximum “sensitivity”, defined as improvement over the random correlation, for enzymes that are present in fewer than 10 % of the genomes in that database.  Above that and the correlation is inflated (but not invalidated) by common enzymes, below that we start to lose predictive power.  This can be seen in the sharp drop in the predicted-random rho (Δrho: is it bad form to mix greek letters with the English version of same?) for enzymes present in less than 1 % of genomes.  Because lots of interesting enzymes are not very common this is where we have to focus our future efforts.  As I mentioned earlier some improvement in this area is automatic; each newly completed genome improves our resolution.

Some additional thoughts on this.  There are parameters in paprica that might improve Δrho.  The contents of closest estimated genomes are determined by a cutoff value – the fraction of descendant genomes a pathway or enzyme appears in.  I redid the Δrho calculations for different cutoff values, ranging from 0.9 to 0.1.  Surprisingly this had only a minor impact on Δrho.  The reason for this is that most of the 16S reads extracted from the metagenomes placed to closest completed genomes (for which cutoff is meaningless) rather than closest estimated genomes.  An additional consideration is that I did all of these calculations for enzyme predictions/observations instead of metabolic pathways.  The reason for this is that predicting metabolic pathways on metagenomes is rather complicated (but doable).  Pathways have the advantage of being more conserved than enzymes however, so I expect to see an improved Δrho when I get around to redoing these calculations with pathways.

Something else that’s bugging me a bit… metagenomes aren’t sets of randomly distributed genomes.  Bacterial community structure is usually logarithmic, with a few dominant taxa and a long tail of rare taxa.  The metabolic inference methods by their nature capture this distribution.  A more interesting test might be to create a logarithmically distributed random population of genomes, but this adds all kinds of additional complexities.  Chief among them being the need to create many random datasets with different (randomly selected) dominant taxa.  That seems entirely too cumbersome for this purpose…

So to summarize…

  1.  Metabolic inference definitively outperforms random selection.  This is good, but I’d like the difference (Δrho) to be larger than it is.
  2. It is not adequate to validate a metabolic inference technique using correlation with a metagenome alone.  The improvement over a randomly generated dataset should be used instead.
  3. paprica, and probably other metabolic inference techniques, have poor predictive power for rare (i.e. very taxonomically constrained) enzymes/pathways.  This shouldn’t surprise anyone.
  4. Alternate validation techniques might be more useful than correlating with the abundance of enzymes/pathways in metagenomes.  Alternatives include correlating the distance in metabolic structure between samples with distance in community structure, as we did in this paper, or correlating predictions for draft genomes.  In that case it would be necessary to generate a distribution of correlation values for the draft genome against the paprica (or other method’s) database, and see where the correlation for the inferred metabolism falls in that distribution.  Because the contents of a draft genome are a little more constrained than the contents of a metagenome I think I’m going to spend some time working on this approach…
Posted in paprica | Leave a comment

Tutorial: Analysis with paprica

paprika

Paprika. Not to be confused with paprica.

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Getting Started

I’ve been making a lot of improvements to paprica, our program for conducting metabolic inference on 16S rRNA gene sequence libraries.  The following is a complete analysis example with paprica to clarify the steps described in the wiki, and to highlight some of the recent improvements to the method.  I’ll continue to update this tutorial as the method evolves.  This tutorial assumes that you have all the dependencies for paprica-run.sh installed and in your PATH.  If you’re a Mac user you can follow the instructions here.  Linux (including Windows Subsystem for Linux) users should use this script as a guide.  This tutorial has been re-written for the most recent version of paprica, and does not directly apply to v0.6 or earlier.

Although not required for this tutorial, I recommend that you have R installed (probably with RStudio) for downstream analysis, and the Archaeopteryx tree viewer.  Follow the appropriate instructions for your platform for R and RStudio.  Archaeopteryx is a little more convoluted; after first installing Java, install Archaeopteryx as such:

## Download the jar file, note that you might need to update this link to reflect the current version.  Visit https://sites.google.com/site/cmzmasek/home/software/archaeopteryx to check.
wget http://www.phyloxml.org/download/forester/forester_1050.jar
mv forester_1050.jar archaeopteryx.jar
## Download the configuration file and rename to something that works
wget http://www.phyloxml.org/download/forester/archaeopteryx/_aptx_configuration_file.txt
mv _aptx_configuration_file.txt aptx_configuration_file

Double click on archaeopteryx.jar to start the program.  Finally, this tutorial assumes that you are using the provided database of metabolic pathways and genome data included in the ref_genome_database directory.  If you want to build a custom database you should follow this tutorial.  All the dependencies are installed and tested?  Before we start let’s get familiar with some terminology.

closest completed genome (CCG): One type of edge: the most closely taxonomically related completed genome to a query read.  This term is only applicable for query reads that place to a terminal edge (branch tip) on the reference tree.

closest estimated genome (CEG): Another type of edge: the set of genes that are estimated to be found  in all members of a clade originating at a branch point in the reference tree.  This term is only applicable to query reads that place to an internal edge on the reference tree.  The CEG is the point of placement.

community structure: The taxonomic structure of a bacterial assemblage.

edge: An edge is a point of placement on a reference tree.  Think of it as a branch of the reference tree.  Edges take the place of OTUs in this workflow, and are ultimately far more interesting and informative than OTUs.  Refer to the pplacer documentation for more.

unique read: A read from a dataset that has been denoised using (e.g.) dada2.

metabolic structure: The abundance of different metabolic pathways within a bacterial assemblage.

reference tree: This is the tree of representative 16S rRNA gene sequences from all completed Bacterial genomes in Genbank.  The topology of the tree defines what pathways are predicted for internal branch points.

Overview

The basic steps that paprica takes during analysis are:

  1. Identify the reads that belong to the specified domain (bacteria, archaea, eukarya).
  2. Place the reads on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes (Archaea) or representatives from all phyla present in RefSeq genomes (Bacteria).  For bacteria, reads that place to a phylum (i.e. not an internal node) are then placed on a 16S + 23S rRNA gene tree comprised of all completed genomes in RefSeq genomes belonging to that phylum.  The domain Eukarya follows a similar procedure as for Bacteria, except that an 18S rRNA gene tree is used and division is used rather than phylum.  The 18S rRNA gene sequences come from the PR2 database rather than RefSeq.
  3. For Bacteria and Archaea only, map the enzymes, pathways, and genome parameters present at each point of placement to the query reads.
  4. Use the points of placement to provide a consensus taxonomy for each query read.

Testing the Installation

You can test your installation of paprica and its dependencies using the provided test.bacteria.fasta or test.archaea.fasta files.  For test.bacteria.fasta, from the paprica directory:

./paprica-run.sh test bacteria

The first argument specifies the name of the input fasta file (test.bacteria.fasta) without the extension.  The second argument specified which domain you are analyzing for.  Executing the command produces a variety of output files in the paprica directory:

ls test*
temp.bacteria
test.archaea.fasta
test.archaeal16S.reads.txt
test.bacteria.clean.fasta
test.bacteria.clean.unique.align.sto
test.bacteria.clean.unique.count
test.bacteria.clean.unique.fasta
test.bacteria.combined_16S.bacteria.tax.placements.csv
test.bacteria.ec.csv
test.bacteria.edge_data.csv
test.bacteria.fasta
test.bacteria.pathways.csv
test.bacteria.sample_data.txt
test.bacteria.sum_ec.csv
test.bacteria.sum_pathways.csv
test.bacteria.unique_seqs.csv
test.bacterial16S.reads.txt
test.eukarya.fasta
test.eukaryote18S.reads.txt
test.fasta
test.unique.count
test.unique.fasta

Each sample fasta file that you run will produce similar output.  These files are described in detail in the Wiki here, with the following being particularly useful to you:

test.bacteria: Because multiple jplace and other intermediate files are created for the domains Bacteria and Eukarya the number of files quickly gets out of control.  Some of these files are useful for downstream analysis and troubleshooting, however, so they are retained in a directory with the sample name.

test.bacteria.combined_16S.bacteria.tax.placements.csv: The *placements.csv file represents a summary of the placement data for each unique query read and is the result of parsing the jplace file(s) produced by epa-ng.  For Bacteria and Eukarya, which have multiple reference trees, this file contains the phylogenetic placement results for all the reference trees for a given sample.

test.bacteria.bacteria.edge_data.csv: This is a csv format file containing data on edge location in the reference tree that received a placement, such as the number of reads that placed, predicted 16S rRNA gene copies, number of reads placed normalized to 16S rRNA gene copies, GC content, etc.  This file describes the taxonomic structure of your sample.

test.bacteria.unique_seqs.csv: This is a csv file that contains the abundance and normalized abundance of unique sequences. Each sequence is identified by a unique hash
(to allow tallying across multiple samples) and the name of a representative read is also provided.

test.bacteria.bacteria.sum_pathways.csv: This is a csv file of all the metabolic pathways inferred for test.bacteria.fasta, by edge.  All possible metabolic pathways are listed (meaning all metabolic pathways that are represented in the database), the number attributed to each edge is given in the column for that edge.

test.bacteria.bacteria.sum_ec.csv: This is a csv file of all the enzymes with EC numbers inferred for test.bacteria.fasta, by edge. The file is structured the same as test.bacteria.bacteria.sum_pathways.csv.

test.bacteria.bacteria.sample_data.txt: This file described some basic information for the sample, such as the database version that was used to make the metabolic inference, the confidence score, total reads used, etc.

test.bacteria.bacteria.sum_pathways.csv: This csv format file describes the metabolic structure of the sample, i.e. pathway abundance across all edges.

If you want to run an analysis with archaeal 16S rRNA or eukaryotic 18S rRNA gene reads you can test the same way.  Note that there is no metabolic inference for the domain eukarya, but you the reads are placed on an extensive set of reference trees that are useful for classification and phylogenetic analysis.

./paprica-run.sh test archaea
./paprica-run.sh test eukarya

Conducting an Analysis – read QC

Okay, that was all well and good for the test.fasta file, which has placements only for a single edge and is not particularly exciting.  Let’s try something a bit more advanced that mimics what you would do for an actual analysis.  First we need to get some data.  In this case we’ll use a set of samples from a recent publication, Webb et al. 2019.  You can acquire these data from the NCBI SRA site using prefetch.  First, create a file titled SRA_select_Run.txt, then populate the file with these accession numbers:

SRX4496910
SRX4496911
SRX4496912
SRX4496913
SRX4496914
SRX4496915
SRX4496916
SRX4496917
SRX4496918
SRX4496919
SRX4496920
SRX4496921
SRX4496922
SRX4496923
SRX4496924
SRX4496925
SRX4496926
SRX4496927
SRX4496928
SRX4496929
SRX4496930
SRX4496931
SRX4496932
SRX4496933
SRX4496934
SRX4496935
SRX4496936
SRX4496937
SRX4496938
SRX4496939
SRX4496940
SRX4496941
SRX4496942
SRX4496943
SRX4496944
SRX4496945
SRX4496946
SRX4496947
SRX4496948
SRX4496949
SRX4496950
SRX4496951
SRX4496952
SRX4496953
SRX4496954

Then execute the following command (if you don’t have it already, install Gnu parallel with your package manger of choice):

parallel fastq-dump --split-files --skip-technical {} < SRA_select_Run.txt

Now you’ll want to QC, denoise, and merge the reads.  I’m not going to cover that here, but strongly recommend using dada2.  You can use this R script, which is a modification of the tutorial on the dada2 site.  Merged reads should be inflated to redundant fasta files before analysis with paprica.  You can use our deunique_dada2.py script for that.

Conducting an Analysis – running paprica

Previously you executed paprica on just a test file (test.fasta).  Here we have a larger number of samples, and we need to construct a loop to run paprica on all of them.  First, copy the paprica/paprica-run.sh file into your working directory:

cp ~/paprica/paprica-run.sh paprica-run.sh

Paprica is pretty fast, and the key parts are already parallelized, so it doesn’t make sense to run the samples in parallel.  We can loop across multiple input files like this, specifying the file to run without the “.fasta” extension:

for f in *exp.fasta;do NAME=$(basename $f .fasta); ./paprica-run.sh $NAME bacteria; done

At this point you have individual analysis files for all your samples.  These can be quite useful, but most useful are classic abundance tables for edges, unique reads, enzymes, and pathways.  You can create these using the paprica-combine_results.py script.  This used to be a separate utility but is now a core part of paprica.  That means the script should reside in your paprica directory, and you call it from your working directory:

paprica-combine_results.py -domain bacteria -o 2017.07.03_seagrass

This produces several files, note that the prefix in the file names is set by the -o flag in the command above, and reflects the nature of these example reads:

2017.07.03_seagrass_bacteria.taxon_map.txt: This file maps edge numbers to the name of the lowest consensus taxonomy (for CEGs) or strain name (for CCGs).

2017.07.03_seagrass_bacteria.seq_edge_map.txt: This file maps unique sequences to the edge it most frequently places to across samples.  The vast majority of reads will place to only a single sample, however, the more samples you have and the more abundant a read is, the more likely it is that you’ll place to more than one edge.

2017.07.03_seagrass_bacteria.edge_data.csv: These are the mean genome parameters for each sample.  Lots of good stuff in here, see the column labels.

2017.07.03_seagrass_bacteria.edge_tally.csv: Edge abundance for each sample (corrected for 16S rRNA gene copy).  This is your community structure, and is equivalent to an OTU table (but much better!).

2017.07.03_seagrass_bacteria.unique_tally.csv: The abundance and normalized abundance of unique sequences.

2017.07.03_seagrass_bacteria.ec_tally.csv: Enzyme abundance for each sample.

2017.07.03_seagrass_bacteria.path_tally.csv: Metabolic pathway abundance for each sample.

To get familiar with some basic operations on the output files, including exploring the data with heatmaps and correspondence analysis, you can continue with this tutorial.

Posted in paprica | 11 Comments

Tutorial: Installing paprica on Mac OSX

The following is a paprica installation tutorial for novice users on Mac OSX (installation on Linux is quite a bit simpler). If you’re comfortable editing your PATH and installing things using bash you probably don’t need to follow this tutorial, just get the dependencies as indicated in the install instructions and linux_install.sh script. If command line operations stress you out, and you haven’t dealt with a lot of weird bioinformatics program installs, use this tutorial.

Please note that this tutorial is a work in progress.  If you notice errors, inconsistencies, or omissions please leave a comment and I’ll be sure to correct them.  This tutorial has been updated for the most recent version of paprica, and will not work for v0.6 or earlier.

** IMPORTANT ** It is generally considered very poor practice to install anything in the root directory.  You might think, “but I’m the only user, so this makes more sense” or “but everyone in the lab wants program X, so I should install as root.”  Don’t do it.  Install to your home directory.  It will add years to your life.

This tutorial assumes you’ve followed this advice, and that you are installing all the dependencies in your home directory.

Install Python and Python packages

paprica is 90 % an elaborate set of wrapper scripts for several core programs written by other groups. The scripts that execute the pipeline are bash scripts, the scripts that do the actual work are Python. Therefore you need Python up and running on your system. If you already have a mainstream v3.0 Python distro going  just make sure that the modules listed below are installed (e.g., withconda install [package]and not pip3).  Note that Python 3 must be callable on your system as “python3” which should be the default.

Install some necessary Python modules, assuming you don’t already have them:

pip3 install numpy
pip3 install biopython
pip3 install joblib
pip3 install pandas
pip3 install seqmagick
pip3 install termcolor

In case you have conflicts with other Python installations, or some other mysterious problems, it’s a good idea to test things out at this point. Open a shell, type “python3”  and:

import numpy
import Bio
import joblib
import pandas
import termcolor

If you get any error messages something somewhere is wrong. Burn some incense and try again. If that doesn’t work try holy water.

Seqmagick is a standalone program, not a module, so check the installation by typing:

seqmagick

You should get a sensible error that is clearly seqmagick yelling at you and not your computer trying to find seqmagick.

Install Homebrew and wget

Older versions of paprica needed the programs pplacer and gappa, which had dependencies that could only be acquired for OSX for a package manager such as Homebrew.  These are no longer needed for paprica but I’ve left the Homebrew step in here because if you’re doing anything sciency with your computer you probably want a package manager, and I find wget to be a much more useful file fetching utility than curl.

To download Homebrew (assuming you don’t already have it) type:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Follow the on-screen instructions.

Now install wget.

brew install wget

Install Infernal, pplacer, and epa-ng

Assuming all that went okay go ahead and download the software you need to execute just the paprica-run.sh portion of paprica. First, the excellent aligner Infernal. From your home directory:

wget http://eddylab.org/infernal/infernal-1.1.1-macosx-intel.tar.gz
tar -xzvf infernal-1.1.1-macosx-intel.tar.gz
mv infernal-1.1.1-macosx-intel infernal

Then gappa:

To install gappa you need make and cmake. And a fairly up-to-date  C+++11 compiler.

brew install make
brew install cmake

If you try to do the compilation  with AppleClang on MacOSX it will probably fail.  You need OpenMp, but  it is a know issue that AppleClang on MacOSx does not work well with OpenMp. Here we provide two alternatives based on a discussion in gappa’s page, and you can read about it here.

  1. Install gappa via conda, instead of compiling on your own (it does not need OpenMp): https://anaconda.org/bioconda/gappa
  2. Instead of AppleClang, use a “proper” clang:
brew install llvm libomp

You’ll need to set custom paths so that the new clang is used:

#Example on how to do this:
export PATH="$ (brew --prefix llvm) /bin:$PATH";
export COMPILER=/usr/local/opt/llvm/bin/clang++
export CFLAGS="-I /usr/local/include -I/usr/local/opt/llvm/include"
export CXXFLAGS="-I /usr/local/include -I/usr/local/opt/llvm/include"
export LDFLAGS="-L /usr/local/lib -L/usr/local/opt/llvm/lib"
export CXX=${COMPILER}

#Not needed for all MacOS versions
#If you get this error: "ld: unknown option: -platform_version"
#You'll need to add:
export CXXFLAGS="${CXXFLAGS} -mlinker-version=450"
export LDFLAGS="${LDFLAGS} -mlinker-version=450"

And now you should be ready to install gappa:

git clone --recursive https://github.com/lczech/gappa.git
cd gappa
make
cd ~

And finally, epa-ng:

brew install brewsci/bio/epa-ng

Add dependencies to PATH

Now comes the tricky bit, you need to add the locations of the executables for these programs to your PATH variable.  This is a pretty important basic computing skill to master.  Try not to screw it up.  It isn’t hard to undo screw-ups, but it will freak you out because bash will suddenly be unable to find programs that it could find before. Before you continue please read the excellent summary of shell startup scripts as they pertain to OSX here:

http://hayne.net/MacDev/Notes/unixFAQ.html#shellStartup

This tutorial attempts to provide a broad solution to shell startup scripts by sourcing .profile and .bash_profile in .bashrc.  I recommend you then only modify .bashrc, though this is not strictly necessary.

## Open .bashrc for editing.

nano .bashrc

At the top of the file type:

source .bash_profile
source .profile

Now navigate to the end of the file and paste the following, modifying as necessary (note: there are lots of syntactic variations for adding a location to PATH, the below commands are a little redundant but clear and easy to modify):

export PATH=/Users/your-user-name/infernal/binaries:${PATH}
export PATH=/Users/your-user-name/infernal/easel:${PATH}
export PATH=/Users/your-user-name/pplacer:${PATH}
export PATH=/Users/your-user-name/epa-ng/bin:${PATH}
export PATH=/Users/your-user-name/paprica:${PATH}
export PATH=/Users/your-user-name/gappa/bin:${PATH}

Don’t be the guy or gal who types your-user-name. Replace with your actual user name. Hit ctrl-o to write out the file, enter to save, and ctrl-x to exit nano.

Re-source .bashrc by typing:

source .bashrc

Confirm that you can execute the following programs by navigating to your home directory and executing each of the following commands:

cmalign
esl-alimerge
gappa
epa-ng

You should get an error message that is clearly from the program, not a bash error like “command not found”.

Get paprica

Okay, now you are ready to get paprica and do some analysis! You can clone the latest repository here :

git clone https://github.com/bowmanjeffs/paprica.git

Now make paprica-run.sh and python scripts executable.

cd paprica 
chmod a+x paprica-run.sh 
chmod a+x *py

At this point you should be ready to rock. Take a deep breath and type:

./paprica-run.sh test bacteria

This analyzes the file test.fasta against the bacteria database.  You should see a lot of output flash by on the screen, and you should see a number of new files in the directory with the prefix “test.”  Checkout the paprica analysis tutorial and manual for more info on these files.

To run your own analysis, say on amazing_sample.fasta against the bacteria database, simply type:

./paprica-run.sh amazing_sample bacteria

Please, please, please, read the manual for further details. Remember that the fasta file you input should contain only reads that are properly QC’d (i.e. low quality ends and adapters and barcodes and such trimmed away) and denoised (e.g., with dada2).

Posted in paprica | 1 Comment