Some things should be easy…

******************
Updated method posted here: https://www.polarmicrobes.org/?p=859
******************

But they’re not.  Getting meaningful taxonomy from blast results is one of those things.  Here’s the situation: I ran a blastn search locally against the NCBI nt (nucleotide) database, saving the output in the (somewhat) easy to manipulate xml format.  I’m interested in catching as many of the close homologues to my queries as possible, so I allowed up to 500 matches per search.  The resulting xml file is pretty substantial.  What I want to do now is identify the genus for each of my hits.  It would be absolutely fantastic if NCBI would include the taxonomic information in the blast results – if would be easy to do on their end – but they don’t.  C’est la vie.

Solution the first (the ugly)

Since the hit definitions (hit_def tag in the xml) contain the scientific name of the species of origin this can be parsed by regular expressions to produce something that is probably, most of the time, the genus name.

with open('big_blast.xml','r') as xml:
        for line in xml:
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = re.sub('<Hit_def>', '', line)
                line = re.sub('', '', line)
                hit_def = line.lower()
                hit_genus = re.split(' ', hit_def)[0]

But this will on occasion produce some weirdness that is not the genus name, and I doubt the exact format of hit_def is stable over time.  It would be nice to tap into the NCBI taxonomy database to get this information right from the source.

Solution the second (the fat)

The NCBI taxonomy database can be acquired from the NCBI ftp site at ftp://ftp.ncbi.nih.gov/pub/taxonomy/.  The large gi_taxid_nucl.dmp.gz file is the key to the NCBI taxonomy safe.  It is a simple tab delimited text file mapping a gi number to a taxonomy id number.  With the taxonomy id number (and a little voodoo) you can obtain a full taxonomy for any gi that has one.  Since the gi number is conveniently embedded in the hit id section of the xml (hit_id tag, pry it out with some more regular expressions) the full taxonomy seems so, so close…

A casual user of Python might look at gi_taxid_nucl.dmp (with head, after using gunzip) and think dictionary.  If the whole thing can be crammed into a dictionary with the gi numbers as keys, the taxid can be easily obtained for any given gi number.  Problem is, it can’t be crammed into a dictionary – at least without a lot of trouble.  Python dictionary creation and operation is memory intensive.  24 gigs of RAM is not nearly enough for this.  I did manage to create a dictionary on a high memory machine (150 gigs, it failed when I initially set ulimit -v to 100 gigs) by iterating across each line in gi_taxid_nucl.dmp, saving the result as a pickle (a what you say?  That’s what I said too.  Turns out that pickles, in addition to being tasty fermented cucumbers, are binary flat files.  These files can store things like Python lists and dictionaries), but the resulting pickle couldn’t be utilized on the 24 gig machine.  I’m a guest on the high memory cluster and am reluctant to hog it for what seems like a simple operation, so I needed a different solution…

Solution the third (the slow)

I don’t have a lot of experience with SQL databases, but I know that this is the intended environment for the NCBI taxonomy.  The SQL workhourse mySQL looked like overkill, so I went with sqlite.  Python has a nice interface for working, via sqlite, with SQL databases so this seemed like a natural way to go.  I created a database for gi_taxid_nucl.dmp as such:

import sqlite3
conn = sqlite3.connect('gi_taxid_nucl.db')
c = conn.cursor()
c.execute('''CREATE TABLE gi_taxid
(gi, taxid)''')

with open('gi_taxid_nucl.dmp', 'r') as map_file:
	for line in map_file:
		line = line.split()
		print line[0], line[1]
		c.execute("INSERT INTO gi_taxid VALUES ("+line[0]+","+line[1]+")")

conn.commit()
conn.close()

The SQL database isn’t much larger than the original tab-delimited file, and didn’t take too much memory to build. It completed in a couple of hours. To get data back out of the database I tried:

import sqlite3
conn = sqlite3.connect('gi_taxid_nucl.db')
c = conn.cursor()
t = (123,)
for row in c.execute('SELECT * FROM gi_taxid WHERE gi=?', t):
	print row

That should pull out the row corresponding to the gi number 123.  And it does.  But it takes a long time.  I didn’t time it, but I’m guessing it would take a few weeks to run tens of thousands of queries.  I’m not sure if there is a faster way to work with sqlite.  It feels like an SQL database should be the solution, but it hasn’t worked out yet…

Solution the fourth (the least ugly, fat, and slow)

The fastest way to parse gi_taxid_nucl.dmp (that I can come up with) is with grep.  It takes grep only a second or two to match a gi number to a line in gi_taxid_nucl.dmp, which is pretty amazing.  My final script is below.  I iterate across a list of gi numbers obtained from my blast output (reformatted into a table, see this post https://www.polarmicrobes.org/?p=753), using grep to match each to a line in gi_taxid_nucl.dmp.  From these lines I can create a nice, reasonably sized dictionary from which I can quickly obtain the taxid when I need it.  The part of the script that actually generates the taxonomy comes from this post: https://groups.google.com/forum/#!msg/qiime-forum/PXem7gMltao/oKOxJ_zFD4cJ.

from cogent.parse.ncbi_taxonomy import NcbiTaxonomyFromFiles
import subprocess

subprocess.call('rm blast_taxid.txt', shell = True)

print 'building dictionary'

n = 0
with open('blast_output_table.txt', 'r') as blast_table:
    for each in blast_table:
        n = n + 1
        print n
        each = each.split('\t')
        subprocess.call('grep --max-count=1 \"'+each[3]+'\" gi_taxid_nucl.dmp | tee -a blast_taxid.txt', shell = True)

map_dict = {}           
with open('blast_taxid.txt') as gi_taxid:
    for each in gi_taxid:
        each = each.split()
        gi = each[0]
        taxid = each[1]
        map_dict[gi] = taxid
        print gi, map_dict[gi]

print 'reading in files'

tree = NcbiTaxonomyFromFiles(open('nodes.dmp'), open('names.dmp'))
root = tree.Root

print 'generating lineages'

def get_lineage(node, my_ranks):
    ranks_lookup = dict([(r,idx) for idx,r in enumerate(my_ranks)])
    lineage = [None] * len(my_ranks)
    curr = node
    while curr.Parent is not None:
        if curr.Rank in ranks_lookup:
            lineage[ranks_lookup[curr.Rank]] = curr.Name
        curr = curr.Parent
    return lineage

ranks = ['genus']
output = open('blast_lineage.txt', 'w')
error = open('error.txt','w')
with open('blast_output_table.txt', 'r') as input:
    for line in input:
        if line[0:5] != 'query':
            try:
                line = line.split('\t')
                taxid = int(map_dict[line[3]])
                node = tree.ById[taxid]
                tax = get_lineage(node, ranks)
                tax = str(tax[0]).lower()
                print >> output, line[0]+'\t'+tax+'\t'+line[2]+'\t'+line[5],
                print line[0], tax.lower()
            except KeyError:
                print 'key error', line[3]

output.close()
error.close()

I watched a show last night featuring researchers who get to dart polar bears from a helicopter, then land and give them medical examinations.  It looked like fun, I wonder if they want to switch dissertation topics…

 

Posted in Research | 7 Comments

Parsing blast xml output

In the previous post I described a method for efficiently blasting a medium-large dataset (33 million sequences).  One down side to using blast on a dataset this size is the very large size of the output xml file, in this case over 40 Gbytes.  The size of the output can be reduced by specifying the table output format (-outfmt 6) when executing blast, but this greatly reduces the amount of information you receive from the run.  If you are going to the trouble of running blast on 33 million sequences you may as well retain all the information possible from each alignment, reducing the chance of needing to blast the dataset again!  The xml format (-outfmt 5) is a (reasonably) user friendly and rich blast output format.  If you are new to bioinformatics however it can be a little daunting – the critical information (best hit, e value, etc.) can look pretty buried in a mysterious jumble of tabs and tags.  Here’s a chunk of the aforementioned xml output, corresponding to a single search:

  <BlastOutput_iterations>
      <Iteration_iter-num>1
      <Iteration_query-ID>Query_1
      <Iteration_query-def>HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147
      <Iteration_query-len>70
      <Iteration_hits>
      <Iteration_stat>
          <Statistics_db-num>17771239
          <Statistics_db-len>44780205193
          <Statistics_hsp-len>30
          <Statistics_eff-space>1769882720920
          <Statistics_kappa>0.46
          <Statistics_lambda>1.28
          <Statistics_entropy>0.85
      <Iteration_message>No hits found

What we have here is some kind of nested format for storing information, identified by tags (<some_tag>), which is all that xml is.  Subordinate pieces of information are more indented than their parents, so in this example all the information belongs to BlastOutput_iterations.  If you are familiar with blast you will recognize some of the tags.  Iteration_query-def for example, gives the name of the query (the cumbersome read id HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147).  The tag Iteration_message is self explanatory, it tells us that this iteration returned no significant matches in the database.  Had there been a match additional tags detailing the match (or matches) would have been included.

Fortunately this use of tags makes it pretty easy to parse blast xml output using regular expressions.  I wrote a short script to convert my massive xml output into a less massive (but still a little cumbersome) tab delimited file from which I can extract specific lines corresponding to a hit using grep.  The basic strategy for scripting xml is to create a variable for each tag of interest and march line by line through the xml, re-saving the variable anytime you encounter the corresponding tag of interest.  When the outermost (least subordinate) tag of interest is encountered (or the last in a known sequence of tags is encountered, as done here), print each variable to a line in an output file.

import re

output = open('test_blast.txt','w')
n = 0
print >> output, 'read'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e'

with open('blast.xml','r') as xml:
        for line in xml:
            if re.search('<Iteration_query-def>', line) != None:
                n = n + 1
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_query-def>')
                line = line.rstrip('</')
                query_def = line
            if re.search('No hits found', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_message>')
                line = line.rstrip('</')
                print n
                print >> output, query_def+'\t'+line
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_def>')
                line = line.rstrip('</')
                hit_def = line
            if re.search('<Hit_accession>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_accession>')
                line = line.rstrip('</')
                hit_acc = line       
            if re.search('<Hsp_evalue>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hsp_evalue>')
                line = line.rstrip('</')
                e_val = line
                print n
                print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 

output.close()

If, having successfully parsed the xml file, I want to access all the reads from my metagenome with a hit containing “nifH” in the definition I can extract that information relatively quickly from my table using:

grep 'nifH' test_blast.txt

Since the information that grep returns will include the query id, a new fasta containing the reads with a blast hit of “nifH” is just a short script away.  One tip… since most of the reads will not have a blast hit you can create a new, smaller table containing only those reads with hits by identifying lines with an e value.

 grep 'e-' test_blast.txt > small_test_blast.txt
Posted in Research | 2 Comments

Mega megablast – conducting a massively parallel blast

The blast algorithm is, with good reason, one of the most widely used tools in bioinformatics.  It allows a reasonably quick comparison between a set of sequences and a (potentially very large) database or set of sequences in a fasta file (slower than using a formatted database).  For very small datasets the NCBI maintains an online blast interface, for datasets with hundreds to thousands of sequences it is best to install the blast executable and databases locally.  A wide selection of pre-formatted database are available on the NCBI ftp site.  But what about using blast for metagenomic analysis?  The relatively small metagenome that I’m working on has 33,000,000 sequence reads, after quality control.  Blast is fast considering the vast number of calculations required to compare all of these reads against even a modestly sized database, but it would still take over a week to run this analysis using blastn (the simplest flavor of blast) against the NCBI nt (nucleotide) database on a single computation node with 12 cpus.  There is a parallel version of blast that makes use of the MPI libraries to split the analysis across multiple nodes, but compiling parallel blast looks a little daunting (I haven’t tried yet).  Fortunately there is a better solution available for users of the UW’s Hyak cluster and similar systems.

On Hyak individual investigators, research groups, or academic units purchase computation nodes for three year deployments.  Because the computational needs of our group are limited – I’m currently the only one doing bioinformatics – we don’t own any nodes.  Fortunately the College of the Environment owns 11 nodes for trial use by investigators who might purchase nodes of their own.  They also allow these nodes to be used by students in research groups that don’t have a long term need for nodes, or for whom purchasing nodes isn’t economically feasible (thank you COE!).  This means that I have access to 11 12 processor nodes.  As a free loader on the system I try to tread lightly on these resources, and mostly succeed.  To blast my metagenome on this system I would need to tie up all the COE nodes for days which, in addition to being rude, would certainly jeopardize my continued use of Hyak.

For this kind of analysis the real power of Hyak comes in access to the backfill queue.  The backfill queu allows users to submit any number of jobs to any idle node in Hyak, whether they own the node or not.  This doesn’t help with memory intensive operations like assembly, but it works great on computationally intensive operations like blast.  The basic idea is to split my fasta file of 33,000,000 reads into smaller files that will blast in less than four hours (a requirement for use of the backfill queu).  A little testing determines that 330,000 reads will blast, using megablast against the NCBI nt database and all 12 cpus available on a single node, in about an hour and a half.  Great!  I split my fasta using (script can be found here):

python split_large_multifasta.py 44_trimmed_combined_h20.fasta 100

Then I use a second Python script to create a shell script for each file, and submit that shell script to the backfill queue:

import subprocess
for i in range(0,101):
	i = str(i)
	output = open(i+'.sh', 'w')
	print >> output, '#!/bin/bash'
	print >> output, '#PBS -N "'+i+'"'
	print >> output, '#PBS -d /gscratch/coenv/bowmanjs/barrow_metagenome_working/large_blast'
	print >> output, '#PBS -l nodes=1:ppn=12,feature=12core,walltime=4:00:00'
	print >> output, '#PBS -M bowmanjs@u.washington.edu'
	print >> output, '#PBS -m abe'
	print >> output, 'cp split_fasta/44_trimmed_combined_h20.fasta_'+i+'.fasta /gscratch/coenv/bowmanjs/databases/'
	print >> output, 'cd /gscratch/coenv/bowmanjs/databases'
	print >> output, 'blastn -task megablast -max_target_seqs 5 -evalue 1e-5 -db nt -outfmt 5 -out '+i+'.xml -query 44_trimmed_combined_h20.fasta_'+i+'.fasta -num_threads 12'
	print >> output, 'mv '+i+'.xml ../barrow_metagenome_working/large_blast/'
	print >> output, 'rm 44_trimmed_combined_h20.fasta_'+i+'.fasta'
	output.close()
	subprocess.call('chmod a+x '+i+'.sh', shell = True)
	subprocess.call('qsub -q bf '+i+'.sh', shell = True)

Notice that there’s a funny bit in the above script where the query fasta file is copied to the directory with the database, and then deleted after the analysis. This is because getting blast to look for the database in anything other than the working directory is quite difficult (one bad quirk of blast – supposedly you can fix it in your blast profile but this never works for me), I find it easier just to work in the directory where I stash all my database files and then move everything out. This isn’t optimal but it works. At any rate executing the above script creates and submits 100 jobs to the backfill queue. There are some minor delays while the queuing system sorts itself out, and while waiting for idle nodes, but the whole thing is complete in about 4 hours. Combine all the xml outputs with:

cat *.xml > combined.xml

Now the bad part. At this point you have a behemoth xml file with all your data, 40+ gigabytes in my case. The bits you want can be pulled out with a new script which I’ll detail in a separate post.

Posted in Research | Leave a comment

At first look at the microbiology of frost flowers

We just published the first analysis of a microbial community inhabiting natural frost flowers in the journal Environmental Microbiology Reports.  Our results are a bit surprising, and I’ll get to them shortly.  First, a brief look at what we did.  Frost flowers have received quite a bit of media attention lately.  If you missed the coverage frost flowers are delicate crystal structures that are nearly ubiquitous to the surface of newly formed sea ice (so long as the ice forms under a relatively cold atmosphere).  Salt, organic material, and bacteria from the source seawater are concentrated in frost flowers during the process of sea ice growth and maturation.

Although chemical and biological reactions are typically suppressed by low temperatures, the high concentrations of all these organic and inorganic materials, and the presence of ample energy (from sunlight), means that some ecologically interesting things might be happening here.  For example check out this paper by a group of atmospheric chemists connected with the OASiS (Ocean-Atmosphere-Sea Ice-Snow) project.  Among other things they report a high concentration of formaldehyde in frost flowers, probably the result of the photolysis (sun-driven breakdown) of larger organic molecules.

Our group is interested in what biological processes might be happening in frost flowers.  To develop testable hypothesis however, we needed to make some initial observations about the system.  The most basic observation that a microbiologist typically wants to make is an assessment of community composition; what bacteria are present and their abundance relative to one another.  To do this we relied on an analysis of the 16S rRNA gene, a commonly used taxonomic marker gene for prokaryotes.  By comparing the 16S rRNA gene sequences in frost flowers with those of known bacteria in a database, we can approximate the composition of the frost flower microbial community.

The small lead from which frost flowers and young sea ice were sampled in April, 2010. Not the prettiest picture of frost flowers around, but representative of 90 % of the days in Barrow.

What we found really surprised us, in fact it took several months to wrap our heads around the results.  The heatmap below shows the relative abundance of bacterial genera in our analysis, as determined by one of the four 16S rRNA gene identification methods we used.  Black indicates bacterial genera below detection, the remainder are scaled from white to blue to red (most abundant).  The columns are our samples; there are four frost flower (FF) and four young sea ice (YI) samples.  Several microbial genera, in particular the Methylobacteria, Rhizobium, and Mesorhizobium, are enriched within frost flowers relative to the underlying young sea ice.  These genera are all members of the order Rhizobiales, thus we posit that that Rhizobiales are, in general, enriched in these frost flowers.

Relative abundance of genera by microarray analyis of the 16S rRNA gene (Bowman et al. 2013). Genera belonging to the order Rhizobiales tend to be enriched in frost flowers relative to young sea ice.

This is pretty weird.  The Rhizobiales are certainly not unheard of in marine waters, but are virtually unreported in sea ice and we can’t find any report of them dominating a marine environment.  And they do truly dominate these frost flowers, in a more in-depth analysis of a single frost flower sample we found that 77 % of the 16S rRNA genes classified as Rhizobiales.  The environment where Rhizobiales do typically dominate is (ironically) on the roots of real flowers, or at least on legumes.  There they engage in a mutualistic relationship with the plant, providing fixed nitrogen in exchange for carbohydrates.

This fact might be a clue as to how Rhizobiales could come to dominate the sea ice surface.  Although there isn’t a lot of evidence for algae (aka phytoplankton trapped in the ice) in these frost flowers, there is for the underlying young ice.  Rhizobiales can be found in close associations with phytoplankton, just as they are with plants, so perhaps the Rhizobiales end up in ice because so many phytoplankton do.  Once in the ice, the now stressed phytoplankton and bacteria might end their relationship (enough stress will test even the closest of relationships), leaving the Rhizobiales free to transport to the ice surface during brine rejection (a key element of sea ice growth).

Of course this is just a hypothesis, but hypothesis development was our goal with this study.  We are left in the best possible situation, with an interesting observation and just enough data to develop further questions (okay, the best possible situation would be a definitive answer and a paper in Nature, but still…).  To refine those questions we are continuing to work with these samples, in particular with metagenomes obtained from one of our frost flower and one of our young ice samples.

 

Posted in Research | Leave a comment

The UAF sea ice course returns!

Another great opportunity was announced at UAF, which seems to be full of them these days.  This is the return of the UAF sea ice course, last taught in 2008.  Here’s the announcement:

*******************************

This spring (May 8-18),  UAF is once again offering a 2cr sea ice field course in Barrow, Alaska (last given in 2008). Instructors will include Hajo Eicken, Rolf Gradinger, Don Perovich, Andy Mahoney, Eric Collins and others. This is a unique class for all students with interest in physical, chemical and biological oceanography as well as marine biology. The course also includes a module on traditional and local knowledge of sea ice. Please pass this on and encourage potential students to apply –

more details are available on the course website:

http://www.sfos.uaf.edu/research/seaicebiota/barrowcourse_2013/

******************************

The sea ice course is an excellent opportunity for any advanced undergraduate or graduate student (or postdoc, or non-degree seeking interested party for that matter) to get some hands-on (and feet-on) experience with sea ice, under the tutelage of some of the worlds most experienced researchers.  The instructor lineup is pretty impressive; Don Perovich, Hajo Eicken, and Rolf Gradinger have been in this business a long time and have contributed very fundamental knowledge.  Eric Collins and Andy Mahoney represent the new generation of researchers, and both have made impressive contributions studying sea ice under challenging winter conditions.  I didn’t start graduate school until after the last sea ice course, so I learned/am learning how to work on and with it by trial and error.  My guess is that many hours and frozen fingers could probably be avoided with this kind of formal training!

Don’t like the cold?  Not to worry… May is balmy in Barrow, there should be 24 hours of daylight by the time the course starts.  Think spring skiing conditions.  Like Mexican food?  Then don’t pass up your opportunity to eat at Pepe’s, the northernmost Mexican restaurant in the world!

DSC_0002

Attendees of the UAF sea ice course will probably get their chance to eat at Pepe’s, the northernmost Mexican restaurant in the world, and one of the better (of a surprising number of) restaurants in Barrow. Here Jody, Marcela, and Jesse enjoy some high latitude Mexican food after a hard day coring sea ice.

 

 

Posted in Uncategorized | Leave a comment

Metagenomic Assembly

In an earlier post I mentioned some odd microbiological observations that our group made during field work in Barrow, AK in 2010.  I also talked about how I’m hoping to repeat that observation this year, using the microscopy technique FISH.  In addition to collecting new data however, there is plenty of work left to be done on the 2010 samples.  I’m spending a lot of time right now working with two metagenomes derived from one young sea ice and one frost flower sample from the 2010 sample set.  A metagenome, as the name suggests, is a compilation of many different genomes.  Consider a liter of seawater.  It might contain around one billion bacteria, and therefor one billion bacterial genomes.  Although bacterial genomes can vary quite a bit even within a given bacterial “species”, for the sake of argument lets say that those one billion bacteria comprise 1000 bacterial species, representing 1000 different genomes.

We’d like to know something about what the bacterial assemblage in our liter of water is doing.  Are they photosynthesizing?  Consuming high molecular weight organic compounds?  Living free in the water or attached to particles?  Clues to all these questions, indeed to the entire natural history of each species, can be found within their genomes.  It isn’t at all practical however, to sequence all 1000 of these genomes (most belonging to species that you couldn’t even bring into pure culture in the lab without many years of work).  The solution?  Sequence all the DNA contained within the water, never mind which species it originally belonged to!

The resulting mess of sequence data is the metagenome, and allows for the least biased way of assessing the metabolic potential of a microbial assemblage.  To do this a researcher sifts through all the little bits of DNA, usually in the range of 50-100 bp, and assigns a putative function to the gene of origin based on similarity to a known gene in a database.  That’s great, but it would still be nice to know something about the metabolic potential of individual species in the assemblage.  This requires assembling the metagenome, something that was not possible until just a few years ago.  Given enough computer power, enough sequence data, and a low diversity assemblage (just a few species), researchers have been able to reconstruct entire microbial genomes from metagenomes.  One such research group is right here at the UW School of Oceanography, and recently published their metagenome derived genome (from an uncultured Euryarchaeota) in the journal Science.

With a lot of guidance from them I’ve established a pipeline for assembling my Barrow metagenomes.  I won’t get complete genomes, there isn’t nearly enough sequence data, but I might be able to create large enough contiguous sequence segments (contigs) to link some metabolic functions with specific bacteria in that environment.  You can check out my workflow here.  In brief I use a series of de Bruijn Graph assemblies and read-to-contig alignments to gradually build bigger contigs.  For data reduction I use a combination of standard trimming tools and digitial normalization.  The workflow page includes links to all of the tools I’m currently using.  The figure below shows the results from my first round assembly.  The x-axis is contig length in kmer (to convert to bp add 22).  The y-axis is coverage, or (roughly) the number of times that particularly contig is seen in the assembly.  In the first round I’ve got contigs out to 10,000 bases, long enough to code for several genes.  Not bad!

Ocov vs. contig length (in kmer, bp = kmer+22) following the first round assembly of PE Illumina reads using Velvet. The reads were trimmed and reduced by digital normalization prior to assembly.

 

Posted in Research | Leave a comment

Place it!

The last couple of posts on this blog have been about 16S gene sequences, and how microbial ecologists use these sequences to identify different bacteria and determine evolutionary relationships.  The primary method for the latter is to build phylogenetic trees.  A phylogenetic tree is basically a “family tree” of a bacterial lineage.  A whole field of statistics exists that describes different ways of inferring the phylogeny behind a tree, and evaluating the confidence in a given tree.

Like any other statistical method what you feed into a tree-building algorithm determines the confidence you can have in the final product.  For sequence data a big factor is sequence length.  A large number of long, high quality, well aligned sequences will invariably produce a tree with high confidence scores that probably represents evolutionary reality.  The problem is that these days sequence lengths are rarely long (or of particularly high quality).  For most analyses we favor next-generation sequence data that produces a lot of short sequence “reads”.  You can do a lot with these reads, but what you can’t do is align them to one another (very well), or produce any kind of reasonable tree.

Enter pplacer.  Pplacer is a great program produced by Erik Matsen’s group at the Fred Hutchinson Cancer Research Center (“the Hutch”).  You can read the original paper on pplacer here.  What pplacer does is map short reads to an exisiting, high quality reference tree (created in advance by the user).  This allows the quantitative taxonomic analysis of next-gen sequence data at a resolution surpassing what can be achieved with the standard classification tools (such as RDP).

antarctic_rhizobiales.final.fat

A “fat” tree produced from pplacer. Several hundred thousand 454 reads were classified down to the order level, reads from the target order were placed on a reference tree constructed from near full-length sequences from the RDP. The wide red “edge” represents a large number of placements within the genes Blastobacter.

Pplacer’s a great program, and with pre-compiled executables (mac and linux only) it’s a cinch to get up and running.  It is however, a rather finicky program.  Rarely has history documented an alignment parser that breaks on so many different characters.  Any punctuation in your sequence names for example, will cause pplacer to turn up its nose at your hard-won sequence data and refuse to cooperate.

I used to maintain an elaborate series of shell scripts for prepping my query and reference sequences for pplacer, laboriously modifying them for each new data set.  I finally got tired of doing this and wrote a wrapper script in Python to clean my ref and query fasta files and do the initial alignment.  The script (PlaceIt) can be found here.  The wrapper relies on the silva core set for alignment, using mothur, so it is only suitable for 16S analysis.  It would be easy to modify it to work with any other method of alignment (note that the Matsen group has a whole suite of scripts for working with pplacer, some of these are bound to be more suitable for many analyses).  If pplacer sounds useful for your work you should check it out!  And maybe PlaceIt can help you…

Posted in Research | 1 Comment

Going FISHing

***NOTE***

For the code in this article to work you must use the reverse complement of the probe, not the probe itself.  I’ll correct it in the near future.

************

I’m preparing for some spring field work in Barrow, Alaska where I’ll be following up (hopefully) on some odd observations that we had there in 2010.  In particular I’d like to quantify some specific marine bacteria in Barrow sea ice.  The method that I’ll use to do this is called FISH, which stands for fluorescent in-situ hybridization.  FISH is a microscopy technique, so it differs from my normal sequence-based approach to evaluating community composition.

On the left, marine bacteria stained with DAPI, a non-specific stain that binds to DNA. On the right one specific clade is identified using FISH. The image came from http://www.teachoceanscience.net/teaching_resources/education_modules/marine_bacteria/explore_trends/.

On the left, marine bacteria stained with DAPI, a non-specific stain that binds to DNA. On the right one specific clade is identified using FISH. The photo is courtesy of Matthew Cottrell to the original site at http://www.teachoceanscience.net/teaching_resources/education_modules/marine_bacteria/explore_trends/.

All we know about our mystery 2010 bacteria at this point are their classification based on partial 16S sequences.  Using this classification I’ve selected several FISH probes.  A FISH probe is a short (18-20 nucleotide) strand of DNA identical to some diagnostic region of a target clade’s 16S rRNA gene (a clade is an evolutionarily related group of organisms).  Since the actual 16S rRNA in the target ribosomes are complementary to the 16S rRNA gene, the probe – if it finds its way into a bacterial cell belonging to that clade – should stick to the ribosome.

Tagged on to one end of the probe is a molecule that fluoresces under green light.  Viewed in a microscope under green light bacteria that have taken up the probe, and in which the probe has adhered to the ribosome, will show up as tiny points of red light.  Assuming you know how much water you made the slide from, counting the tiny points of light tells you how many members of the target clade were present when the sample was taken.

FISH probes can be have a broad or narrow specificity.  There are probes for example, that tag virtually all Bacteria and Archaea (universal probes), just Bacteria or just Archaea (domain level probes), and down through the taxonomic rankings to probes that tag only for specific bacterial “species”.  The probes that I’ll be using target at approximately the family level  (if your high school biology is too far back, the taxonomic rankings go domain, phylum, class, order, family, genus, species).

I’m fortunate to have two metagenomic datasests from our 2010 field season as this gives me the opportunity to test the probes before going into the field.  Without testing I don’t really know if they’ll target the bacteria we observed in 2010.  A metagenome is a collection of short DNA sequences from the most abundant genes in the environment, in this case 10-20 million sequence reads.  Among that massive collection of gene fragments are fragments belonging to 16S rRNA genes, and some fraction of those belong to the bacteria that I’d like to count this year.  With a little Python it’s a simple thing to search the metagenome for sequence fragments that match the probe:

import re
probes = ['TCGCTGCCCACTGTC', 'CCGACGGCTAACATTC','CTCCACTGTCCGCGACC']
found_42 = []
found_44 = []

for each in probes:
     n = 0
     h = 0
     probe = re.compile(each)

     with open('overlapped_42.fasta','r') as fasta:
          for line in fasta:
               n = n + 1
               print n
               if re.search(probe, line) != None:
                    h = h + 1
     found_42.append(h)
     fasta.close()

print found_42

The “with open” statement was new to me, the advantage to using it here is that the sequence files (5-10 Gb) are not held in memory.  The script isn’t exactly fast, it took about 2 hours to search all the sequences in both metagenomes for all three probes.  You might notice that it wastes a lot of time looking at the sequence id lines in the fasta files, eliminating that inefficiency could probably cut down on time substantially.  The good news is that the number of probe hits matches really well with our expectations.  Over 2500 hits between the three probes for one of the metagenomes!  Had we used FISH in 2010 it is very likely that we would have seen an abundance of our target bacteria, hopefully we will this year…

Sidenote: Bacteria vs. bacteria… It may appear that my capitalization is inconsistent for bacteria.  Bacteria can refer, as a proper noun, to the domain Bacteria.  It can also be used as a common noun for prokaryotic life in general, in which case it is simply bacteria.  Unfortunately some have advocated rather strongly for the word bacteria to replace the word prokaryote on the grounds that the latter term suggests a closer evolutionary relationship between the Archaea and Bacteria than actually exists.  Given the strong ecological overlap between the Bacteria and Archaea it is essential to have a term that refers to both as an ecological unit, and prokaryote, while imperfect, is much less misleading than bacteria!

Posted in Research | Leave a comment

A new model for the origin of life

Sort of.  Over the last three years I’ve  been fortunate to participate in a working group on the origin of life, coordinated by John Baross.  John is the only faculty member in the group, the rest are graduate students from various departments (Earth and Space Sciences, Atmospheric Sciences, Microbiology, and Oceanography) participating in the Astrobiology Program.  Three members of the original group are now postdoctoral researchers at other institutions (Princeton, Stanford, and the Carnegie Institute).  The working group began as a seminar series.  Each week (over pizza and beer – critical ingredients for the origin of life) a different student, or pair of students, would discuss some aspect of the origin of life as it related to their area of expertise.  After a couple of years, and with a lot of encouragement from John, we decided to formalize our thoughts on the subject in a review article, now in publication in Geobiology.

The writing effort was led by Eva Stueken, a doctoral student in Earth and Space Sciences who made the biggest conceptual leaps for the group (check out Eva’s other recent paper in Nature Geosciences on the sulfur cycle of the early Earth).  In the review article we try to develop the concept of early Earth as a “global reactor”.  Often research groups within the origin of life community develop expertise on a specific geological environment and the synthesis therein of some key chemical precursors to life.  This kind of specialization is necessary; without in-depth knowledge it is impossible to advance understanding.  The pitfall is that it’s too easy to focus on your area of expertise and try to make the whole origin of life happen there whether it be a warm little pond, a hydrothermal vent, or a glacier.

The problem to this approach is that the early Earth wasn’t composed of a single geological environment, and there is no reason to think that all of the chemistry and selection necessary to produce life had to happen in once place.  Chemically and physically the early Earth was an incredibly diverse and dynamic place, with a range of conditions partitioned in microenvironments.  Hydrothermal vents, marine sediment, seawater, beaches, ice, dust grains in the atmosphere, volcanic pumice, and a thousand other sites each hosted many microenvironments just as they do today.  Instead of one site (e.g. a black smoker in a hydrothermal vent field) hosting all the different reactions required to produce life it seems more likely that different processes would occur in different places, wherever the conditions were most optimal for each process and the stability of the products  These products would have had many opportunities to mix, mingle, be selected, and react further in the oceans, land, and atmosphere of the Early Earth.

lo_res_sites

From Stueken et al 2013. It is reasonable to consider that the origin of life would have been the result of the synthesis, transport, and selection of chemical precursors across a range of environments. A select few environments are shown here, with associated mechanisms for transporting reaction products.

I started this entry with a note that this is not really a new model for the origin of life. There is not yet any complete model to replace, and what we develop in the review is more of a scaffold than an model.  We stop short of noting the specific reactions that would need to occur in the global reactor and the environments in which they would be partitioned, necessary elements for a working model of the origin of life.  Hopefully other researchers with specific expertise in these reactions will be able to use the scaffold to place these reactions in context, as the model itself undergoes further refinement and modification.

Posted in Research | 1 Comment

RIP Carl Woese!

Sunday marked the passing of Carl Woese, one of the most important figures in microbial ecology in recent decades.  In fact that’s a bit of an understatement, his discoveries and the methods he pioneered have profoundly altered our understanding of the diversity of life on Earth.  Dr. Woese’s key innovation came at the dawn of the genomic era, when the sequencing of DNA was a laborious process carried out without automation.  In those days taxonomy, microbial and otherwise, was deduced from an organism’s physiology and morphology.  A new bacterium might be classified for example, based on cell shape, gram stain, and growth conditions.  Even in multicellular organisms morphology and growth conditions can be misleading – not everything that flies and breathes air is a bird.  In microscopic organisms this almost never works.  What scientists needed was a way to use the newly available gene sequences to deduce the evolutionary patterns underlying taxonomy.

For this to work however, they needed to find a slowly evolving gene that is present in all taxonomic groups.  Early work focused on the gene for a small structural RNA in the ribosome (ribosomal RNA, or rRNA) called 5S.  Ribosomes are present in all cells, and have been since the last universal common ancestor.  For various reasons ribosomes don’t evolve very quickly, meaning that the gene sequences coding for their structural components are somewhat similar between distantly related organisms – but they evolve enough that there are differences in the sequences from closely related organisms.  5S sequence comparison worked well enough to show that traditional microbial taxonomy was wrong, but due to the gene’s small size not well enough to resolve a true evolutionary history.

Woese focused on a larger rRNA gene called 16S (18S in eukaryotes), first comparing only a couple of 16S sequences (1972, 1974), then a few more (1975), and finally enough to see the big picture in 1977 .  The basic method established by Woese to differentiate organisms based on 16S sequence similarity is still the gold standard for microbial taxonomy today, and the basis for much of the work in our lab.  The picture has become more complicated since 1977, we now understand that rRNA sequence similarity is only tangentially related to physiology and to what other genes are contained within an organism’s genome.  Since this is often the information we really want, knowing the rRNA sequence (and therefor the “species”) of an organism doesn’t always give us the answers we need.

Had Woese only honed the method of 16S rRNA sequence taxonomy it still would have been an impressive scientific work.  He is best known however, for the big picture that he observed in 1977 using 16S sequence comparison; that all life on Earth is distributed among three broad domains – the Bacteria, Archaea, and Eukarya.  Only the domain Eukarya contains multicellular life (and all that diversity is compressed into a tiny corner).  Under a microscope Bacteria and Archaea are indistinguishable.  Without rRNA sequence data we might never have realized that these groups are as different from one another as we are to them.

tree_of_life

The classic “Woesian” phylogenetic tree showing the three domains of life. The more distant two branches are, the more distant the evolutionary relationship or organisms on them (and the greater the difference in their 16 or 18S rRNA gene sequences). To give yourself a sense of scale consider how close we are to Zea (corn). I’ve had this tree on my hard drive for a while and I’m still trying to track down a citation. Will update when I find it.

Ultimately what Woese gave us were the tools for a lesson in humility.  We can now see that humans, mammals, vertebrates, even all animals and plants are a very minor component of life on Earth, and account for a shockingly small portion of life’s diversity.  To quote Woese (via the NY Times):

“It’s clear to me that if you wiped all multicellular life-forms off the face of the earth, microbial life might shift a tiny bit, If microbial life were to disappear, that would be it — instant death for the planet.”

Well said!

Posted in Uncategorized | Leave a comment