A little Europa here in Washington

I got an email the other day from a colleague who teaches science at Soap Lake High School, a rural high school in eastern Washington that, in my opinion, punches above its weight in the sciences.  One of his students is preparing a presentation on Europa, and she had some questions for a Europa expert.  I’m not a Europa expert, but I interact with people who are, so I suppose I’m only one degree removed from actual knowledge of Europa…

Soap Lake High School is located (predictably) near Soap Lake; a highly alkaline, highly stratified saline lake.  Because of the lake’s physical structure and chemistry it is a useful analogue for the Europan ocean.  It’s not Lake Vostok, but no analogy is perfect.  Europa is a pretty hot topic in astrobiology, and there is plenty of information out there on it, but I figured it wouldn’t hurt to post my answers to her questions here.  It also lets me expand on the original answers a little.

Why do scientists think that life could be on Europa?

Scientists think that life needs two things – liquid water and a source of energy.  Europa is one of the few places in our solar system (other than Earth) where both of these things are likely abundant.  There is a lot of evidence that underneath Europa’s ice shell is a huge ocean, much larger in volume than the oceans on Earth.  Because Europa is so close to Jupiter, and because of Jupiter’s massive size, there is a huge gravitational force exerted on Europa.  This gravitational force heats Europa much in the way that Earth is heated by radioactive decay in its core, possibly enough to produce low-level volcanism at the sea floor.  As on Earth, seafloor volcanism  produces springs, black smokers, and other sources of reduced elements, useful for life as an energy source (when paired with oxidized material).  On Earth these volcanic sites are hotspots for life.

For energy, life requires both oxidized and reduced material.  Oxidized material should be abundant at the surface of Europa’s ice shell, thanks to constant bombardment by radiation.  There is lots of evidence that Europa’s ice shell turns over gradually, bringing oxidized material into the ocean and exposing fresh ice to radiation.  On Earth sea ice is a hotspot for photosynthetic life because sunlight is abundant – like blacks smokers it is energy rich.  A similar situation could exist on Europa at the interface between ice and water, since this environment is optimally located between the oxidized ice and the reduced ocean.  If Europa were a battery the seafloor and ice surface would be the negative and positive terminals (respectively).  Life at the interface acts as a circuit, using the electrons for biological processes as they flow from the negative to the positive terminal (from seafloor derived material to ice surface derived material).

Pages from Seeking Europa's ocean

From Seeking Europa’s Ocean (Pappalarado, 2010). This cut away shows many putative features of the Europan ice shell, including the strong temperature gradient, convective diapirs, and the surface features produced by these internal processes.

How is Soap Lake similar to Europa?  How is it different?

There is a lot of “chemoautotrophic” life in Soap Lake.  This is life that, rather than using the sun as a primary energy source (in the case of plants by creating reduced carbon and free oxygen), uses reduced and oxidized chemicals – the negative and positive battery terminals discussed before.  These chemicals may have a biological source (for example ammonia or nitrate or organic carbon) or a geological source (like iron).  If the chemical being used is not organic (does not contain carbon-hydrogen bonds), life is said to live lithoautotrophically – this kind of life can theoretically happen in the absolute absence of sunlight (no photosynthesis going on anywhere in the environment).  Finding locations on Earth where this is happening is challenging, because photosynthesis influences almost everything here.  On Europa there is essentially no light that can be used in photosynthesis, so we would expect a lot of lithoautotrophy.  Without photosynthesis the ecosystem of Soap Lake would look pretty different, but there are good examples of microbial life there that can serve as a model for life on Europa.  One example is bacteria that gain energy via sulfur oxidation, where reduced sulfur compounds are oxidized by oxygen.

If there is life of Europa, what do you think it looks like?

I think it would look pretty much like life on Earth – or at least microbial life from the right habitats on Earth.  Evolution seems to produce the same solutions to problems over and over again – this idea is called convergent evolution – and the problems life would face on Europa are not so different than what it faces on Earth.  It seems likely that evolution would produce the same basic structures and metabolic strategies to deal with these problems.

Do YOU think there’s life on Europa, and what will it mean if there is?

I think Europa is the most likely place in the solar system, other than Earth, for life to exist.  The fact that it is so far from Earth (unlike Mars) makes this even more exciting.  If we find life on Europa there is almost no chance that it came from Earth, it would have to have had a separate origin.  This would tell us a lot about how widespread life is in the universe.  Finding life on Mars would be a little different, chances are that it would have something to do with life on Earth (Earth and Mars exchange asteroids and other material all the time, providing a vector for biological contamination).

How soon do you think it will be before we know for sure if there’s life on Europa?

Unfortunately it will be a long time before we know if there is life on Europa – a couple of generations if we maintain an interest in exploring the outer solar system.  The most we’ve ever been able to study Europa came with the Galileo spacecraft which was sent to study the entire Jupiter system.  It flew past Europa multiple times and from those observations we know a little about the composition of the moon and the likelihood of liquid water.  We’ve never had a spacecraft in orbit around Europa however, let alone landed on its surface!  Since life is most likely to exist below several miles of hard, thick ice even getting to the surface won’t tell us if life exists.  Somehow we need to land on the surface and melt or drill through all that ice into the ocean below (warm ice rising in the shell might also contain life, but that would still involve a lot of drilling).  This is technically feasible but would be very very expensive to do.  We might find tantalizing evidence of life from a spacecraft in orbit or on the surface, perhaps in the form of chemicals that were most likely produced by life, but this isn’t absolute proof.

Posted in Research | Leave a comment

Some things should be easy…

******************
Updated method posted here: https://www.polarmicrobes.org/?p=859
******************

But they’re not.  Getting meaningful taxonomy from blast results is one of those things.  Here’s the situation: I ran a blastn search locally against the NCBI nt (nucleotide) database, saving the output in the (somewhat) easy to manipulate xml format.  I’m interested in catching as many of the close homologues to my queries as possible, so I allowed up to 500 matches per search.  The resulting xml file is pretty substantial.  What I want to do now is identify the genus for each of my hits.  It would be absolutely fantastic if NCBI would include the taxonomic information in the blast results – if would be easy to do on their end – but they don’t.  C’est la vie.

Solution the first (the ugly)

Since the hit definitions (hit_def tag in the xml) contain the scientific name of the species of origin this can be parsed by regular expressions to produce something that is probably, most of the time, the genus name.

with open('big_blast.xml','r') as xml:
        for line in xml:
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = re.sub('<Hit_def>', '', line)
                line = re.sub('', '', line)
                hit_def = line.lower()
                hit_genus = re.split(' ', hit_def)[0]

But this will on occasion produce some weirdness that is not the genus name, and I doubt the exact format of hit_def is stable over time.  It would be nice to tap into the NCBI taxonomy database to get this information right from the source.

Solution the second (the fat)

The NCBI taxonomy database can be acquired from the NCBI ftp site at ftp://ftp.ncbi.nih.gov/pub/taxonomy/.  The large gi_taxid_nucl.dmp.gz file is the key to the NCBI taxonomy safe.  It is a simple tab delimited text file mapping a gi number to a taxonomy id number.  With the taxonomy id number (and a little voodoo) you can obtain a full taxonomy for any gi that has one.  Since the gi number is conveniently embedded in the hit id section of the xml (hit_id tag, pry it out with some more regular expressions) the full taxonomy seems so, so close…

A casual user of Python might look at gi_taxid_nucl.dmp (with head, after using gunzip) and think dictionary.  If the whole thing can be crammed into a dictionary with the gi numbers as keys, the taxid can be easily obtained for any given gi number.  Problem is, it can’t be crammed into a dictionary – at least without a lot of trouble.  Python dictionary creation and operation is memory intensive.  24 gigs of RAM is not nearly enough for this.  I did manage to create a dictionary on a high memory machine (150 gigs, it failed when I initially set ulimit -v to 100 gigs) by iterating across each line in gi_taxid_nucl.dmp, saving the result as a pickle (a what you say?  That’s what I said too.  Turns out that pickles, in addition to being tasty fermented cucumbers, are binary flat files.  These files can store things like Python lists and dictionaries), but the resulting pickle couldn’t be utilized on the 24 gig machine.  I’m a guest on the high memory cluster and am reluctant to hog it for what seems like a simple operation, so I needed a different solution…

Solution the third (the slow)

I don’t have a lot of experience with SQL databases, but I know that this is the intended environment for the NCBI taxonomy.  The SQL workhourse mySQL looked like overkill, so I went with sqlite.  Python has a nice interface for working, via sqlite, with SQL databases so this seemed like a natural way to go.  I created a database for gi_taxid_nucl.dmp as such:

import sqlite3
conn = sqlite3.connect('gi_taxid_nucl.db')
c = conn.cursor()
c.execute('''CREATE TABLE gi_taxid
(gi, taxid)''')

with open('gi_taxid_nucl.dmp', 'r') as map_file:
	for line in map_file:
		line = line.split()
		print line[0], line[1]
		c.execute("INSERT INTO gi_taxid VALUES ("+line[0]+","+line[1]+")")

conn.commit()
conn.close()

The SQL database isn’t much larger than the original tab-delimited file, and didn’t take too much memory to build. It completed in a couple of hours. To get data back out of the database I tried:

import sqlite3
conn = sqlite3.connect('gi_taxid_nucl.db')
c = conn.cursor()
t = (123,)
for row in c.execute('SELECT * FROM gi_taxid WHERE gi=?', t):
	print row

That should pull out the row corresponding to the gi number 123.  And it does.  But it takes a long time.  I didn’t time it, but I’m guessing it would take a few weeks to run tens of thousands of queries.  I’m not sure if there is a faster way to work with sqlite.  It feels like an SQL database should be the solution, but it hasn’t worked out yet…

Solution the fourth (the least ugly, fat, and slow)

The fastest way to parse gi_taxid_nucl.dmp (that I can come up with) is with grep.  It takes grep only a second or two to match a gi number to a line in gi_taxid_nucl.dmp, which is pretty amazing.  My final script is below.  I iterate across a list of gi numbers obtained from my blast output (reformatted into a table, see this post https://www.polarmicrobes.org/?p=753), using grep to match each to a line in gi_taxid_nucl.dmp.  From these lines I can create a nice, reasonably sized dictionary from which I can quickly obtain the taxid when I need it.  The part of the script that actually generates the taxonomy comes from this post: https://groups.google.com/forum/#!msg/qiime-forum/PXem7gMltao/oKOxJ_zFD4cJ.

from cogent.parse.ncbi_taxonomy import NcbiTaxonomyFromFiles
import subprocess

subprocess.call('rm blast_taxid.txt', shell = True)

print 'building dictionary'

n = 0
with open('blast_output_table.txt', 'r') as blast_table:
    for each in blast_table:
        n = n + 1
        print n
        each = each.split('\t')
        subprocess.call('grep --max-count=1 \"'+each[3]+'\" gi_taxid_nucl.dmp | tee -a blast_taxid.txt', shell = True)

map_dict = {}           
with open('blast_taxid.txt') as gi_taxid:
    for each in gi_taxid:
        each = each.split()
        gi = each[0]
        taxid = each[1]
        map_dict[gi] = taxid
        print gi, map_dict[gi]

print 'reading in files'

tree = NcbiTaxonomyFromFiles(open('nodes.dmp'), open('names.dmp'))
root = tree.Root

print 'generating lineages'

def get_lineage(node, my_ranks):
    ranks_lookup = dict([(r,idx) for idx,r in enumerate(my_ranks)])
    lineage = [None] * len(my_ranks)
    curr = node
    while curr.Parent is not None:
        if curr.Rank in ranks_lookup:
            lineage[ranks_lookup[curr.Rank]] = curr.Name
        curr = curr.Parent
    return lineage

ranks = ['genus']
output = open('blast_lineage.txt', 'w')
error = open('error.txt','w')
with open('blast_output_table.txt', 'r') as input:
    for line in input:
        if line[0:5] != 'query':
            try:
                line = line.split('\t')
                taxid = int(map_dict[line[3]])
                node = tree.ById[taxid]
                tax = get_lineage(node, ranks)
                tax = str(tax[0]).lower()
                print >> output, line[0]+'\t'+tax+'\t'+line[2]+'\t'+line[5],
                print line[0], tax.lower()
            except KeyError:
                print 'key error', line[3]

output.close()
error.close()

I watched a show last night featuring researchers who get to dart polar bears from a helicopter, then land and give them medical examinations.  It looked like fun, I wonder if they want to switch dissertation topics…

 

Posted in Research | 7 Comments

Parsing blast xml output

In the previous post I described a method for efficiently blasting a medium-large dataset (33 million sequences).  One down side to using blast on a dataset this size is the very large size of the output xml file, in this case over 40 Gbytes.  The size of the output can be reduced by specifying the table output format (-outfmt 6) when executing blast, but this greatly reduces the amount of information you receive from the run.  If you are going to the trouble of running blast on 33 million sequences you may as well retain all the information possible from each alignment, reducing the chance of needing to blast the dataset again!  The xml format (-outfmt 5) is a (reasonably) user friendly and rich blast output format.  If you are new to bioinformatics however it can be a little daunting – the critical information (best hit, e value, etc.) can look pretty buried in a mysterious jumble of tabs and tags.  Here’s a chunk of the aforementioned xml output, corresponding to a single search:

  <BlastOutput_iterations>
      <Iteration_iter-num>1
      <Iteration_query-ID>Query_1
      <Iteration_query-def>HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147
      <Iteration_query-len>70
      <Iteration_hits>
      <Iteration_stat>
          <Statistics_db-num>17771239
          <Statistics_db-len>44780205193
          <Statistics_hsp-len>30
          <Statistics_eff-space>1769882720920
          <Statistics_kappa>0.46
          <Statistics_lambda>1.28
          <Statistics_entropy>0.85
      <Iteration_message>No hits found

What we have here is some kind of nested format for storing information, identified by tags (<some_tag>), which is all that xml is.  Subordinate pieces of information are more indented than their parents, so in this example all the information belongs to BlastOutput_iterations.  If you are familiar with blast you will recognize some of the tags.  Iteration_query-def for example, gives the name of the query (the cumbersome read id HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147).  The tag Iteration_message is self explanatory, it tells us that this iteration returned no significant matches in the database.  Had there been a match additional tags detailing the match (or matches) would have been included.

Fortunately this use of tags makes it pretty easy to parse blast xml output using regular expressions.  I wrote a short script to convert my massive xml output into a less massive (but still a little cumbersome) tab delimited file from which I can extract specific lines corresponding to a hit using grep.  The basic strategy for scripting xml is to create a variable for each tag of interest and march line by line through the xml, re-saving the variable anytime you encounter the corresponding tag of interest.  When the outermost (least subordinate) tag of interest is encountered (or the last in a known sequence of tags is encountered, as done here), print each variable to a line in an output file.

import re

output = open('test_blast.txt','w')
n = 0
print >> output, 'read'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e'

with open('blast.xml','r') as xml:
        for line in xml:
            if re.search('<Iteration_query-def>', line) != None:
                n = n + 1
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_query-def>')
                line = line.rstrip('</')
                query_def = line
            if re.search('No hits found', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_message>')
                line = line.rstrip('</')
                print n
                print >> output, query_def+'\t'+line
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_def>')
                line = line.rstrip('</')
                hit_def = line
            if re.search('<Hit_accession>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_accession>')
                line = line.rstrip('</')
                hit_acc = line       
            if re.search('<Hsp_evalue>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hsp_evalue>')
                line = line.rstrip('</')
                e_val = line
                print n
                print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 

output.close()

If, having successfully parsed the xml file, I want to access all the reads from my metagenome with a hit containing “nifH” in the definition I can extract that information relatively quickly from my table using:

grep 'nifH' test_blast.txt

Since the information that grep returns will include the query id, a new fasta containing the reads with a blast hit of “nifH” is just a short script away.  One tip… since most of the reads will not have a blast hit you can create a new, smaller table containing only those reads with hits by identifying lines with an e value.

 grep 'e-' test_blast.txt > small_test_blast.txt
Posted in Research | 2 Comments

Mega megablast – conducting a massively parallel blast

The blast algorithm is, with good reason, one of the most widely used tools in bioinformatics.  It allows a reasonably quick comparison between a set of sequences and a (potentially very large) database or set of sequences in a fasta file (slower than using a formatted database).  For very small datasets the NCBI maintains an online blast interface, for datasets with hundreds to thousands of sequences it is best to install the blast executable and databases locally.  A wide selection of pre-formatted database are available on the NCBI ftp site.  But what about using blast for metagenomic analysis?  The relatively small metagenome that I’m working on has 33,000,000 sequence reads, after quality control.  Blast is fast considering the vast number of calculations required to compare all of these reads against even a modestly sized database, but it would still take over a week to run this analysis using blastn (the simplest flavor of blast) against the NCBI nt (nucleotide) database on a single computation node with 12 cpus.  There is a parallel version of blast that makes use of the MPI libraries to split the analysis across multiple nodes, but compiling parallel blast looks a little daunting (I haven’t tried yet).  Fortunately there is a better solution available for users of the UW’s Hyak cluster and similar systems.

On Hyak individual investigators, research groups, or academic units purchase computation nodes for three year deployments.  Because the computational needs of our group are limited – I’m currently the only one doing bioinformatics – we don’t own any nodes.  Fortunately the College of the Environment owns 11 nodes for trial use by investigators who might purchase nodes of their own.  They also allow these nodes to be used by students in research groups that don’t have a long term need for nodes, or for whom purchasing nodes isn’t economically feasible (thank you COE!).  This means that I have access to 11 12 processor nodes.  As a free loader on the system I try to tread lightly on these resources, and mostly succeed.  To blast my metagenome on this system I would need to tie up all the COE nodes for days which, in addition to being rude, would certainly jeopardize my continued use of Hyak.

For this kind of analysis the real power of Hyak comes in access to the backfill queue.  The backfill queu allows users to submit any number of jobs to any idle node in Hyak, whether they own the node or not.  This doesn’t help with memory intensive operations like assembly, but it works great on computationally intensive operations like blast.  The basic idea is to split my fasta file of 33,000,000 reads into smaller files that will blast in less than four hours (a requirement for use of the backfill queu).  A little testing determines that 330,000 reads will blast, using megablast against the NCBI nt database and all 12 cpus available on a single node, in about an hour and a half.  Great!  I split my fasta using (script can be found here):

python split_large_multifasta.py 44_trimmed_combined_h20.fasta 100

Then I use a second Python script to create a shell script for each file, and submit that shell script to the backfill queue:

import subprocess
for i in range(0,101):
	i = str(i)
	output = open(i+'.sh', 'w')
	print >> output, '#!/bin/bash'
	print >> output, '#PBS -N "'+i+'"'
	print >> output, '#PBS -d /gscratch/coenv/bowmanjs/barrow_metagenome_working/large_blast'
	print >> output, '#PBS -l nodes=1:ppn=12,feature=12core,walltime=4:00:00'
	print >> output, '#PBS -M bowmanjs@u.washington.edu'
	print >> output, '#PBS -m abe'
	print >> output, 'cp split_fasta/44_trimmed_combined_h20.fasta_'+i+'.fasta /gscratch/coenv/bowmanjs/databases/'
	print >> output, 'cd /gscratch/coenv/bowmanjs/databases'
	print >> output, 'blastn -task megablast -max_target_seqs 5 -evalue 1e-5 -db nt -outfmt 5 -out '+i+'.xml -query 44_trimmed_combined_h20.fasta_'+i+'.fasta -num_threads 12'
	print >> output, 'mv '+i+'.xml ../barrow_metagenome_working/large_blast/'
	print >> output, 'rm 44_trimmed_combined_h20.fasta_'+i+'.fasta'
	output.close()
	subprocess.call('chmod a+x '+i+'.sh', shell = True)
	subprocess.call('qsub -q bf '+i+'.sh', shell = True)

Notice that there’s a funny bit in the above script where the query fasta file is copied to the directory with the database, and then deleted after the analysis. This is because getting blast to look for the database in anything other than the working directory is quite difficult (one bad quirk of blast – supposedly you can fix it in your blast profile but this never works for me), I find it easier just to work in the directory where I stash all my database files and then move everything out. This isn’t optimal but it works. At any rate executing the above script creates and submits 100 jobs to the backfill queue. There are some minor delays while the queuing system sorts itself out, and while waiting for idle nodes, but the whole thing is complete in about 4 hours. Combine all the xml outputs with:

cat *.xml > combined.xml

Now the bad part. At this point you have a behemoth xml file with all your data, 40+ gigabytes in my case. The bits you want can be pulled out with a new script which I’ll detail in a separate post.

Posted in Research | Leave a comment

At first look at the microbiology of frost flowers

We just published the first analysis of a microbial community inhabiting natural frost flowers in the journal Environmental Microbiology Reports.  Our results are a bit surprising, and I’ll get to them shortly.  First, a brief look at what we did.  Frost flowers have received quite a bit of media attention lately.  If you missed the coverage frost flowers are delicate crystal structures that are nearly ubiquitous to the surface of newly formed sea ice (so long as the ice forms under a relatively cold atmosphere).  Salt, organic material, and bacteria from the source seawater are concentrated in frost flowers during the process of sea ice growth and maturation.

Although chemical and biological reactions are typically suppressed by low temperatures, the high concentrations of all these organic and inorganic materials, and the presence of ample energy (from sunlight), means that some ecologically interesting things might be happening here.  For example check out this paper by a group of atmospheric chemists connected with the OASiS (Ocean-Atmosphere-Sea Ice-Snow) project.  Among other things they report a high concentration of formaldehyde in frost flowers, probably the result of the photolysis (sun-driven breakdown) of larger organic molecules.

Our group is interested in what biological processes might be happening in frost flowers.  To develop testable hypothesis however, we needed to make some initial observations about the system.  The most basic observation that a microbiologist typically wants to make is an assessment of community composition; what bacteria are present and their abundance relative to one another.  To do this we relied on an analysis of the 16S rRNA gene, a commonly used taxonomic marker gene for prokaryotes.  By comparing the 16S rRNA gene sequences in frost flowers with those of known bacteria in a database, we can approximate the composition of the frost flower microbial community.

The small lead from which frost flowers and young sea ice were sampled in April, 2010. Not the prettiest picture of frost flowers around, but representative of 90 % of the days in Barrow.

What we found really surprised us, in fact it took several months to wrap our heads around the results.  The heatmap below shows the relative abundance of bacterial genera in our analysis, as determined by one of the four 16S rRNA gene identification methods we used.  Black indicates bacterial genera below detection, the remainder are scaled from white to blue to red (most abundant).  The columns are our samples; there are four frost flower (FF) and four young sea ice (YI) samples.  Several microbial genera, in particular the Methylobacteria, Rhizobium, and Mesorhizobium, are enriched within frost flowers relative to the underlying young sea ice.  These genera are all members of the order Rhizobiales, thus we posit that that Rhizobiales are, in general, enriched in these frost flowers.

Relative abundance of genera by microarray analyis of the 16S rRNA gene (Bowman et al. 2013). Genera belonging to the order Rhizobiales tend to be enriched in frost flowers relative to young sea ice.

This is pretty weird.  The Rhizobiales are certainly not unheard of in marine waters, but are virtually unreported in sea ice and we can’t find any report of them dominating a marine environment.  And they do truly dominate these frost flowers, in a more in-depth analysis of a single frost flower sample we found that 77 % of the 16S rRNA genes classified as Rhizobiales.  The environment where Rhizobiales do typically dominate is (ironically) on the roots of real flowers, or at least on legumes.  There they engage in a mutualistic relationship with the plant, providing fixed nitrogen in exchange for carbohydrates.

This fact might be a clue as to how Rhizobiales could come to dominate the sea ice surface.  Although there isn’t a lot of evidence for algae (aka phytoplankton trapped in the ice) in these frost flowers, there is for the underlying young ice.  Rhizobiales can be found in close associations with phytoplankton, just as they are with plants, so perhaps the Rhizobiales end up in ice because so many phytoplankton do.  Once in the ice, the now stressed phytoplankton and bacteria might end their relationship (enough stress will test even the closest of relationships), leaving the Rhizobiales free to transport to the ice surface during brine rejection (a key element of sea ice growth).

Of course this is just a hypothesis, but hypothesis development was our goal with this study.  We are left in the best possible situation, with an interesting observation and just enough data to develop further questions (okay, the best possible situation would be a definitive answer and a paper in Nature, but still…).  To refine those questions we are continuing to work with these samples, in particular with metagenomes obtained from one of our frost flower and one of our young ice samples.

 

Posted in Research | Leave a comment

The UAF sea ice course returns!

Another great opportunity was announced at UAF, which seems to be full of them these days.  This is the return of the UAF sea ice course, last taught in 2008.  Here’s the announcement:

*******************************

This spring (May 8-18),  UAF is once again offering a 2cr sea ice field course in Barrow, Alaska (last given in 2008). Instructors will include Hajo Eicken, Rolf Gradinger, Don Perovich, Andy Mahoney, Eric Collins and others. This is a unique class for all students with interest in physical, chemical and biological oceanography as well as marine biology. The course also includes a module on traditional and local knowledge of sea ice. Please pass this on and encourage potential students to apply –

more details are available on the course website:

http://www.sfos.uaf.edu/research/seaicebiota/barrowcourse_2013/

******************************

The sea ice course is an excellent opportunity for any advanced undergraduate or graduate student (or postdoc, or non-degree seeking interested party for that matter) to get some hands-on (and feet-on) experience with sea ice, under the tutelage of some of the worlds most experienced researchers.  The instructor lineup is pretty impressive; Don Perovich, Hajo Eicken, and Rolf Gradinger have been in this business a long time and have contributed very fundamental knowledge.  Eric Collins and Andy Mahoney represent the new generation of researchers, and both have made impressive contributions studying sea ice under challenging winter conditions.  I didn’t start graduate school until after the last sea ice course, so I learned/am learning how to work on and with it by trial and error.  My guess is that many hours and frozen fingers could probably be avoided with this kind of formal training!

Don’t like the cold?  Not to worry… May is balmy in Barrow, there should be 24 hours of daylight by the time the course starts.  Think spring skiing conditions.  Like Mexican food?  Then don’t pass up your opportunity to eat at Pepe’s, the northernmost Mexican restaurant in the world!

DSC_0002

Attendees of the UAF sea ice course will probably get their chance to eat at Pepe’s, the northernmost Mexican restaurant in the world, and one of the better (of a surprising number of) restaurants in Barrow. Here Jody, Marcela, and Jesse enjoy some high latitude Mexican food after a hard day coring sea ice.

 

 

Posted in Uncategorized | Leave a comment

Metagenomic Assembly

In an earlier post I mentioned some odd microbiological observations that our group made during field work in Barrow, AK in 2010.  I also talked about how I’m hoping to repeat that observation this year, using the microscopy technique FISH.  In addition to collecting new data however, there is plenty of work left to be done on the 2010 samples.  I’m spending a lot of time right now working with two metagenomes derived from one young sea ice and one frost flower sample from the 2010 sample set.  A metagenome, as the name suggests, is a compilation of many different genomes.  Consider a liter of seawater.  It might contain around one billion bacteria, and therefor one billion bacterial genomes.  Although bacterial genomes can vary quite a bit even within a given bacterial “species”, for the sake of argument lets say that those one billion bacteria comprise 1000 bacterial species, representing 1000 different genomes.

We’d like to know something about what the bacterial assemblage in our liter of water is doing.  Are they photosynthesizing?  Consuming high molecular weight organic compounds?  Living free in the water or attached to particles?  Clues to all these questions, indeed to the entire natural history of each species, can be found within their genomes.  It isn’t at all practical however, to sequence all 1000 of these genomes (most belonging to species that you couldn’t even bring into pure culture in the lab without many years of work).  The solution?  Sequence all the DNA contained within the water, never mind which species it originally belonged to!

The resulting mess of sequence data is the metagenome, and allows for the least biased way of assessing the metabolic potential of a microbial assemblage.  To do this a researcher sifts through all the little bits of DNA, usually in the range of 50-100 bp, and assigns a putative function to the gene of origin based on similarity to a known gene in a database.  That’s great, but it would still be nice to know something about the metabolic potential of individual species in the assemblage.  This requires assembling the metagenome, something that was not possible until just a few years ago.  Given enough computer power, enough sequence data, and a low diversity assemblage (just a few species), researchers have been able to reconstruct entire microbial genomes from metagenomes.  One such research group is right here at the UW School of Oceanography, and recently published their metagenome derived genome (from an uncultured Euryarchaeota) in the journal Science.

With a lot of guidance from them I’ve established a pipeline for assembling my Barrow metagenomes.  I won’t get complete genomes, there isn’t nearly enough sequence data, but I might be able to create large enough contiguous sequence segments (contigs) to link some metabolic functions with specific bacteria in that environment.  You can check out my workflow here.  In brief I use a series of de Bruijn Graph assemblies and read-to-contig alignments to gradually build bigger contigs.  For data reduction I use a combination of standard trimming tools and digitial normalization.  The workflow page includes links to all of the tools I’m currently using.  The figure below shows the results from my first round assembly.  The x-axis is contig length in kmer (to convert to bp add 22).  The y-axis is coverage, or (roughly) the number of times that particularly contig is seen in the assembly.  In the first round I’ve got contigs out to 10,000 bases, long enough to code for several genes.  Not bad!

Ocov vs. contig length (in kmer, bp = kmer+22) following the first round assembly of PE Illumina reads using Velvet. The reads were trimmed and reduced by digital normalization prior to assembly.

 

Posted in Research | Leave a comment

Place it!

The last couple of posts on this blog have been about 16S gene sequences, and how microbial ecologists use these sequences to identify different bacteria and determine evolutionary relationships.  The primary method for the latter is to build phylogenetic trees.  A phylogenetic tree is basically a “family tree” of a bacterial lineage.  A whole field of statistics exists that describes different ways of inferring the phylogeny behind a tree, and evaluating the confidence in a given tree.

Like any other statistical method what you feed into a tree-building algorithm determines the confidence you can have in the final product.  For sequence data a big factor is sequence length.  A large number of long, high quality, well aligned sequences will invariably produce a tree with high confidence scores that probably represents evolutionary reality.  The problem is that these days sequence lengths are rarely long (or of particularly high quality).  For most analyses we favor next-generation sequence data that produces a lot of short sequence “reads”.  You can do a lot with these reads, but what you can’t do is align them to one another (very well), or produce any kind of reasonable tree.

Enter pplacer.  Pplacer is a great program produced by Erik Matsen’s group at the Fred Hutchinson Cancer Research Center (“the Hutch”).  You can read the original paper on pplacer here.  What pplacer does is map short reads to an exisiting, high quality reference tree (created in advance by the user).  This allows the quantitative taxonomic analysis of next-gen sequence data at a resolution surpassing what can be achieved with the standard classification tools (such as RDP).

antarctic_rhizobiales.final.fat

A “fat” tree produced from pplacer. Several hundred thousand 454 reads were classified down to the order level, reads from the target order were placed on a reference tree constructed from near full-length sequences from the RDP. The wide red “edge” represents a large number of placements within the genes Blastobacter.

Pplacer’s a great program, and with pre-compiled executables (mac and linux only) it’s a cinch to get up and running.  It is however, a rather finicky program.  Rarely has history documented an alignment parser that breaks on so many different characters.  Any punctuation in your sequence names for example, will cause pplacer to turn up its nose at your hard-won sequence data and refuse to cooperate.

I used to maintain an elaborate series of shell scripts for prepping my query and reference sequences for pplacer, laboriously modifying them for each new data set.  I finally got tired of doing this and wrote a wrapper script in Python to clean my ref and query fasta files and do the initial alignment.  The script (PlaceIt) can be found here.  The wrapper relies on the silva core set for alignment, using mothur, so it is only suitable for 16S analysis.  It would be easy to modify it to work with any other method of alignment (note that the Matsen group has a whole suite of scripts for working with pplacer, some of these are bound to be more suitable for many analyses).  If pplacer sounds useful for your work you should check it out!  And maybe PlaceIt can help you…

Posted in Research | 1 Comment

Going FISHing

***NOTE***

For the code in this article to work you must use the reverse complement of the probe, not the probe itself.  I’ll correct it in the near future.

************

I’m preparing for some spring field work in Barrow, Alaska where I’ll be following up (hopefully) on some odd observations that we had there in 2010.  In particular I’d like to quantify some specific marine bacteria in Barrow sea ice.  The method that I’ll use to do this is called FISH, which stands for fluorescent in-situ hybridization.  FISH is a microscopy technique, so it differs from my normal sequence-based approach to evaluating community composition.

On the left, marine bacteria stained with DAPI, a non-specific stain that binds to DNA. On the right one specific clade is identified using FISH. The image came from http://www.teachoceanscience.net/teaching_resources/education_modules/marine_bacteria/explore_trends/.

On the left, marine bacteria stained with DAPI, a non-specific stain that binds to DNA. On the right one specific clade is identified using FISH. The photo is courtesy of Matthew Cottrell to the original site at http://www.teachoceanscience.net/teaching_resources/education_modules/marine_bacteria/explore_trends/.

All we know about our mystery 2010 bacteria at this point are their classification based on partial 16S sequences.  Using this classification I’ve selected several FISH probes.  A FISH probe is a short (18-20 nucleotide) strand of DNA identical to some diagnostic region of a target clade’s 16S rRNA gene (a clade is an evolutionarily related group of organisms).  Since the actual 16S rRNA in the target ribosomes are complementary to the 16S rRNA gene, the probe – if it finds its way into a bacterial cell belonging to that clade – should stick to the ribosome.

Tagged on to one end of the probe is a molecule that fluoresces under green light.  Viewed in a microscope under green light bacteria that have taken up the probe, and in which the probe has adhered to the ribosome, will show up as tiny points of red light.  Assuming you know how much water you made the slide from, counting the tiny points of light tells you how many members of the target clade were present when the sample was taken.

FISH probes can be have a broad or narrow specificity.  There are probes for example, that tag virtually all Bacteria and Archaea (universal probes), just Bacteria or just Archaea (domain level probes), and down through the taxonomic rankings to probes that tag only for specific bacterial “species”.  The probes that I’ll be using target at approximately the family level  (if your high school biology is too far back, the taxonomic rankings go domain, phylum, class, order, family, genus, species).

I’m fortunate to have two metagenomic datasests from our 2010 field season as this gives me the opportunity to test the probes before going into the field.  Without testing I don’t really know if they’ll target the bacteria we observed in 2010.  A metagenome is a collection of short DNA sequences from the most abundant genes in the environment, in this case 10-20 million sequence reads.  Among that massive collection of gene fragments are fragments belonging to 16S rRNA genes, and some fraction of those belong to the bacteria that I’d like to count this year.  With a little Python it’s a simple thing to search the metagenome for sequence fragments that match the probe:

import re
probes = ['TCGCTGCCCACTGTC', 'CCGACGGCTAACATTC','CTCCACTGTCCGCGACC']
found_42 = []
found_44 = []

for each in probes:
     n = 0
     h = 0
     probe = re.compile(each)

     with open('overlapped_42.fasta','r') as fasta:
          for line in fasta:
               n = n + 1
               print n
               if re.search(probe, line) != None:
                    h = h + 1
     found_42.append(h)
     fasta.close()

print found_42

The “with open” statement was new to me, the advantage to using it here is that the sequence files (5-10 Gb) are not held in memory.  The script isn’t exactly fast, it took about 2 hours to search all the sequences in both metagenomes for all three probes.  You might notice that it wastes a lot of time looking at the sequence id lines in the fasta files, eliminating that inefficiency could probably cut down on time substantially.  The good news is that the number of probe hits matches really well with our expectations.  Over 2500 hits between the three probes for one of the metagenomes!  Had we used FISH in 2010 it is very likely that we would have seen an abundance of our target bacteria, hopefully we will this year…

Sidenote: Bacteria vs. bacteria… It may appear that my capitalization is inconsistent for bacteria.  Bacteria can refer, as a proper noun, to the domain Bacteria.  It can also be used as a common noun for prokaryotic life in general, in which case it is simply bacteria.  Unfortunately some have advocated rather strongly for the word bacteria to replace the word prokaryote on the grounds that the latter term suggests a closer evolutionary relationship between the Archaea and Bacteria than actually exists.  Given the strong ecological overlap between the Bacteria and Archaea it is essential to have a term that refers to both as an ecological unit, and prokaryote, while imperfect, is much less misleading than bacteria!

Posted in Research | Leave a comment

A new model for the origin of life

Sort of.  Over the last three years I’ve  been fortunate to participate in a working group on the origin of life, coordinated by John Baross.  John is the only faculty member in the group, the rest are graduate students from various departments (Earth and Space Sciences, Atmospheric Sciences, Microbiology, and Oceanography) participating in the Astrobiology Program.  Three members of the original group are now postdoctoral researchers at other institutions (Princeton, Stanford, and the Carnegie Institute).  The working group began as a seminar series.  Each week (over pizza and beer – critical ingredients for the origin of life) a different student, or pair of students, would discuss some aspect of the origin of life as it related to their area of expertise.  After a couple of years, and with a lot of encouragement from John, we decided to formalize our thoughts on the subject in a review article, now in publication in Geobiology.

The writing effort was led by Eva Stueken, a doctoral student in Earth and Space Sciences who made the biggest conceptual leaps for the group (check out Eva’s other recent paper in Nature Geosciences on the sulfur cycle of the early Earth).  In the review article we try to develop the concept of early Earth as a “global reactor”.  Often research groups within the origin of life community develop expertise on a specific geological environment and the synthesis therein of some key chemical precursors to life.  This kind of specialization is necessary; without in-depth knowledge it is impossible to advance understanding.  The pitfall is that it’s too easy to focus on your area of expertise and try to make the whole origin of life happen there whether it be a warm little pond, a hydrothermal vent, or a glacier.

The problem to this approach is that the early Earth wasn’t composed of a single geological environment, and there is no reason to think that all of the chemistry and selection necessary to produce life had to happen in once place.  Chemically and physically the early Earth was an incredibly diverse and dynamic place, with a range of conditions partitioned in microenvironments.  Hydrothermal vents, marine sediment, seawater, beaches, ice, dust grains in the atmosphere, volcanic pumice, and a thousand other sites each hosted many microenvironments just as they do today.  Instead of one site (e.g. a black smoker in a hydrothermal vent field) hosting all the different reactions required to produce life it seems more likely that different processes would occur in different places, wherever the conditions were most optimal for each process and the stability of the products  These products would have had many opportunities to mix, mingle, be selected, and react further in the oceans, land, and atmosphere of the Early Earth.

lo_res_sites

From Stueken et al 2013. It is reasonable to consider that the origin of life would have been the result of the synthesis, transport, and selection of chemical precursors across a range of environments. A select few environments are shown here, with associated mechanisms for transporting reaction products.

I started this entry with a note that this is not really a new model for the origin of life. There is not yet any complete model to replace, and what we develop in the review is more of a scaffold than an model.  We stop short of noting the specific reactions that would need to occur in the global reactor and the environments in which they would be partitioned, necessary elements for a working model of the origin of life.  Hopefully other researchers with specific expertise in these reactions will be able to use the scaffold to place these reactions in context, as the model itself undergoes further refinement and modification.

Posted in Research | 1 Comment