Android, Mendeley, Referey, and taming the reference beast

Like everyone else in Academia I’m in a constant struggle against information overload.  This comes not only from the overwhelming amount of data from analyses I’m working on, but from an ever-growing stream of scientific literature.  I gave up trying to stay on top of journal tables-of-contents a long time ago, instead I rely on Google Scholar alerts to notify me when relevant publications come out.  Unfortunately this switch didn’t reduce the volume of literature that catches my eye, it merely insured that all the literature would be relevant, interesting, and even more important to file away somewhere and remember.

Some people seem to have a great mental filing system for authors, dates, and subjects of key articles, and can pull a critical reference from the depths of their hard drive based on title alone.  I’m not one of those people.  If I can’t find a reference by a keyword search it doesn’t exist in my universe.  Fortunately, in front of a laptop it is generally possibly to find what you want by keyword search.  Both Windows Search and OSX Spotlight automatically index the contents of pdfs, making it easy to find all the papers in a personal library pertaining to say, Wigglesworthia.  To make this even easier about a year ago I started using Mendeley Desktop, a pretty decent program for organizing and annotating pdfs.  I don’t use Mendeley quite the way its meant to be used however.  The free version of the program comes with a generous chunk of online storage space, which allows you to maintain an online copy of your library that’s accessible from anywhere.  My pdf library is pretty large however, and I’m hesitant to upgrade to the paid version of Mendeley – which comes with more storage space – until I’m really sold on the program (which is happening quickly, aided by the hope that Mendeley’s cite-while-you-write add on will forever remove from my life the horrible, evil, travesty which is Endnote…).

Enter the Android tablet that I recently acquired.  My vision is to be able to sit in a class/seminar/meeting, with that critical paper/fact/reference on the tip of my tongue, and actually be able to LOOK IT UP.  Should be simple enough, right?  Well, maybe.  An additional requirement is that I need to be able to look things up when I’m at sea, in the field, or otherwise not connected to the internet.  This makes it a little trickier.  After reading a number of blog posts from people with similar problems I was able to patch together a solution that works (so far) remarkably well.  The limitations of Android and Apple are equal in this respect, so this solution should be relevant for ipad users too.

My initial though was to migrate my pdf library to Dropbox, and use the Android Dropbox app and native search function to find what I wanted.  Two problems: 1) The Dropbox app doesn’t actually sync files to the tablet (crazy!) and 2) There IS NO native search function on either Droid or Apple ios that searches file contents (crazier still!).  Fortunately I came across Dropsync, a competent little app that picks up where Dropbox falls flat.  With the paid version you can keep select Dropbox folders synced to the SD card in your Droid.  Problem one solved.

If I didn’t need the search function to work without internet I could have downloaded one of a couple of Mendeley-compatible apps, upgraded to the paid version of Mendeley, and lived the life of online library access (ipad owners can cut the middle man and make use of an official Mendely app).  Needing an offline search feature I was pretty well stuck until I found out about Referey (thanks Mendeley, for posting links to third-party apps that help your users…).  Referey was designed with my exact problem in mind.  Here’s how it works:

Having migrated your entire reference library to a folder in Dropbox, you create a symlink in that folder to the sqlite database where Mendeley keeps the index for your library.  On Windows 7 I found the database in C:\Users\Jeff\AppData\Local\Mendeley Ltd\Mendeley Desktop\largedatabasewithyourmendeleyusername.sqlite.  From the destination folder in Dropbox I created a symlink as such:

MKlink largedatabasewithyourmendeleyusername.sqlite "C:\Users\Jeff\AppData\Local\Mendeley Ltd\Mendeley Desktop\largedatabasewithyourmendeleyusername.sqlite"

I believe for Apple/Unix the command is “symlink”.  That keeps both my reference library and the Mendeley index up to date on my tablet.  On your tablet, in preferences, you tell Referey where in your Dropbox (synced with Dropsync, not Dropbox) you’ve stashed the database and your library.  Referey does no further indexing, it simply uses what Mendeley already knows to find what you want in the library.  Pdfs returned in a search can be opened and annotated using whatever app you want (I’m currently using iAnnotate PDF).

For a while I really doubted whether a solution existed for what seems like an obvious problem.  I was pretty happy to find one…

Posted in Research | 3 Comments

Frost flower metagenome paper submitted

I just hit the “submit” button for our latest frost flower paper, which reviewer-be-willing will appear in an upcoming polar and alpine special issue of FEMS Microbial Ecology.  Several scattered bits of analysis from the paper have appeared in this blog, here’s a recap of what we did and why.

In a 2013 paper we described an unusual community of bacteria in Arctic frost flowers composed primarily of members of the bacterial order Rhizobiales.  This was a bit odd because Rhizobiales, while not unknown in the marine environment, or not typically found in high abundance there.  Rhizobiales have definitely not been observed in association with sea ice.  To try and figure out what was going on with this particular community we went back to the same samples (originally collected in April, 2010) and used whole-genome amplification to amplify all the DNA until we had enough to sequence metagenomes from one frost flower and one young sea ice sample.  The good people at Argonne National Lab did the actual sequencing (great work, great prices…).

Following our failed effort in February of 2010 we received an email in April that natural frost flowers were forming.  I went up the next day and sampled from this lead, these are the samples from our 2013 paper (Bowman et al., EMIR, 2013).  That night it snowed a foot, eradicating all frost flowers.  If I'd been a day later I would have missed them.

The field of frost flowers yielding the samples for this paper, and for Bowman et al. 2013.

I’ve never worked with metagenomic data before, and once I had the actual sequence reads it took me a while to figure out how to get the most information out of them.  For a first cut I threw them into MG-RAST.  This confirmed the Rhizobiales dominance easily enough, but for publication quality results you really need to do the analysis yourself.  Ultimately I used a four-pronged approach to evaluate the metagenomes.

MG-RAST, great for a first cut of your data, and for wowing your advisor with snazzy graphics, but no good for rigorous analysis.  This figure from MG-RAST shows the recruitment of metagenomic sequence reads to the genome of Sinorhizobium meliloti 1021.

MG-RAST, great for a first cut of your data, and for wowing your advisor with snazzy graphics, but no good for rigorous analysis. This figure from MG-RAST shows the recruitment of metagenomic sequence reads to the genome of Sinorhizobium meliloti 1021.

1.  Extract 16S rRNA reads and analyze taxonomy.  For this I use mothur, treating the 16S associated reads like any 16S amplicon dataset (accept that you can’t cluster them!).

2.  Align all the reads against all available completed genomes.  We’ve got 6,000 or so genomes in Genbank, may as well do something with them.  This actually provides a ton of information, and backs up the 16S based taxonomy.

3.  Assemble.  I’m not that good at this, the final contigs weren’t that long.  Cool though.

4.  Targeted annotation.  One would love to blastx all the reads against the nr database, or even refseq protein, but this is silly.  It would tie up thousands of cpus for days,this  computing power is better spent elsewhere.  If you have some idea what you’re looking for however, you can blastx against small protein databases pretty easily.  After the blast search you can use pplacer or a similar tool to conduct a more fine scale analysis of the reads with positive matches.

So what’d we find?  The frost flower metagenome contains lots of things that look like Rhizobiales, but that are pretty different from most of the Rhizobiales with sequenced genomes in Genbank.  Among other things they don’t have nodulation genes or genes for nitrogen fixation, unlike many terrestrial Rhizobiales.  The community does have genes for some interesting processes however, including the utilization of dimethylsulfide (DMS) – a climatically active biogenic gas, the degradation of glycine betaine – which may be an important compound in the nitrogen cycle, and haloperoxidases – oxidative stress enzymes that release climatically active methylhalides.

You can find more details on some of this analysis on this poster.

Posted in Research | 1 Comment

Maintaining an updated record of Genbank genomes

For an ongoing project I need a local copy of all the prokaryotic genomes in Genbank.  New genomes are being added an an ever-increasing rate, making it difficult to keep up by manually downloading them from the ftp site.  I finally decided to write a script that will automatically keep my local directory up to date.  I’m only interested in the ffn files, but it would be easy to manipulate the following script for any other file type.  As always suggestions for improving the script are welcome.  A word of warning… I’ve tested all the components of the script individually, but due to long download times I haven’t tested the whole thing as a single entity.

First, the necessary modules:

import subprocess
import os
from ftplib import FTP
import sys
from cStringIO import StringIO

I want the ffn files from both the complete and draft Genbank FTP directories.  Working first with complete:

bacteria = subprocess.Popen('wget -r -A *.ffn -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/', shell = True)
bacteria.communicate()

That should have created a directory tree mimicking that found on the FTP server, but containing only files ending in ffn.  Each genome directory likely contains multiple files, one for each element of the genome.  I’d like to combine them all together so that I can analyze the genome as a unit.

have = os.listdir('combined_ffn')
for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria'):
    if d+'.combined.ffn' not in have:
        subprocess.call('cat ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/'+d+'/*.ffn > combined_ffn/'+d+'.combined.ffn', shell = True)

Now moving on to the draft genomes.  This is a little trickier since genome directories will be removed from Bacteria_DRAFT once assembly and annotation is complete.  My local directory needs to reflect these changes.  To figure out what should be in the draft directory, and remove any local directories that should not longer be there:

old_stdout = sys.stdout
sys.stdout = mystdout = StringIO()

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('genbank/genomes/Bacteria_DRAFT/')
ftp.dir()
ftp.quit()

sys.stdout = old_stdout
with open('temp.txt', 'w') as good_out:
    print >> good_out, mystdout.getvalue()

for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT'):
    if d not in mystdout.getvalue():
        subprocess.call('rm -r ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d, shell = True)

Now update the local directory tree.

d_bacteria = subprocess.Popen('wget -r -A *scaffold.ffn.tgz -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/', shell = True)
d_bacteria.communicate()

Another hang up with the draft genomes is that the ffn files are all present as gzipped tarballs.  They need to be untarred before concatenation.

untar = subprocess.Popen('for d in ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/*;do cd $d;ls *.tgz | parallel "tar -xzvf {.}.tgz";rm *.tgz;cd /volumes/deming/databases/genomes_ffn;done', shell = True)
untar.communicate()

And now we’re on the home stretch…

for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT'):
    cont = False
    contents = os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d)
    for f in contents:
        if 'ffn' in f:
            cont = True
    if d+'.draft.combined.ffn' in have:
        cont = False
    if cont == True:
        subprocess.call('cat ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d+'/*.ffn > combined_ffn/'+d+'.draft.combined.ffn', shell = True)

I tried to write this script in both pure bash and pure Python, and failed in both cases.  A more experienced scripter would probably have little trouble with either, but my version does illustrate how well bash and python play together.  For the first run I executed the script manually.  It’ll probably take a couple days to download on our building’s 1 Mbps connection (come on UW, it’s 2013…).  Once things are running smooth I’ll place a call to the script in /private/etc/periodic/weekly and have an automatically updated database!

 

Posted in Research | Leave a comment

Phylogenetic Placement Revisited

UW Oceanography has a student-lead bioinformatics seminar that meets every spring and fall (or at least has since the fall of 2012 – we hope to continue it).  Each quarter is a little different, this time around each seminar attendee will present on a tool or set of tools that they’ve applied in their research.  It’s a good way for us to keep up on the ever-changing slate of available tools for sequence analysis, and to delve a little deeper into the ones we currently use.  This week its my turn, and I’ve prepared a short tutorial on pplacer, a common phylogenetic placement program.  I wrote a post about phylogenetic placement a few months ago.

There is some unpublished data posted on the seminar website, so it unfortunately isn’t public.  To make the tutorial available I’ve copied it here.  What was great about putting this tutorial together was the opportunity to try papara, an alignment program developed by the same group responsible for RAxML.  Check out Scenario 2 in the tutorial for details on papara.

**************************************************************************

This is a tutorial for using pplacer with 16S rRNA genes under two three separate scenarios:

1.  Reference and reference-query alignment using mothur, tree building using RAxML
2.  Reference alignment using mothur, reference-query alignment using papara, tree building using RAxML
3.  Reference alignment using mothur, reference-query alignment using papara, tree building using FastTreeMP

There is really no difference between these 16S rRNA examples and execution with functional gene or aa data, except that of course you would need an alternate alignment strategy (clustalo, hmmalign, muscle) to generate the reference alignment.

Raw data files and scripts appear at the bottom of this page.  You will need to download the following:

mothur
RAxML (compile the PTHREADS versions)
FastTree (use the MP version)
papara – make sure you have recent versions of boost and the gcc compiler, does not compile with gcc v4.2
pplacer (and taxtastic, guppy)
Archaeopteryx

The following commands are specific to the system they were run on, so be sure to chance directory paths where necessary and the number of cpus.  If you are running this on a mac’nbook set cpus/threads to 2 for mothur and RAxML.  For pplacer to work well the query dataset should fit within the taxonomic range represented by the reference tree.  In this example a set of 16S rRNA reads were classified at the level of order, and we will try to place these reads on a reference tree constructed from full length 16S sequences spanning that order.

Scenario 1 – Reference and reference-query alignment using mothur, tree building using RAxML

First, clean all punctuation from sequence names, note that this will also degap the sequences.  This step is necessary because pplacer tends to break on punctuation.

tr " " "_" < rdp_rhizobiales_type.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > rdp_rhizobiales_type.clean.fasta

Now use mothur to align the reference sequences against an existing high-quality (we hope) alignment.  I’ve used a concatenation of the Eukaryote, Bacteria, and Archaea Silva alignments available on the mothur website.  It’s too big to share on this site, and is overkill for this analysis, but easy to put together if you want to.  Otherwise just use the Bacteria reference file found here.

mothur "#align.seqs(candidate=rdp_rhizobiales_type.clean.fasta, \
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(trump=., vertical=T, processors=8);\
unique.seqs()"

## rename to something convenient

mv rdp_rhizobiales_type.clean.filter.unique.fasta \
rdp_rhizobiales_type.final.fasta

Now build the reference tree with RAxML.  You’ll note that the tree is not bootstrapped, pplacer can’t handle the statistics file that comes with the bootstrapped tree.  It is possible to work around this by generating multiple trees (with multiple statistics files), here we just aren’t going to worry about it.  In practice reviewers don’t seem to worry that much about it either.  Note that we need the alignment in phy format, and that RAxML won’t write over files of the same name.

perl ./fasta2phyml.pl rdp_rhizobiales_type.final.fasta
rm RAxML*
raxmlHPC-Pthreads -m GTRGAMMA \
-n rdp_rhizobiales_type.final \
-p 1234 -T 8 \
-s rdp_rhizobiales_type.final.phymlAln

pplacer wants the original fasta reference alignment, the reference tree, and the tree building statistics file all packaged together in one re-usable package.  The use of reference packages also aids in the implementation of some of pplacers fancier features, like taxonomic classification.  Ref package creation is accomplished with the taxit command from taxtastic, which comes bundled with pplacer.  Taxit won’t overwrite an existing reference package, so it’s useful to remove it beforehand.

rm -r rdp_type_rhizobiales.refpkg
taxit create -l 16S -P rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats RAxML_info.rdp_rhizobiales_type.final \
--tree-file RAxML_bestTree.rdp_rhizobiales_type.final

Now we prep the query reads in the same manner as the reference sequences.

tr " " "_" < galand_2009_rhizobiales.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > galand_2009_rhizobiales.clean.fasta

Combine the query and reference sequences.

cat galand_2009_rhizobiales.clean.fasta \
rdp_rhizobiales_type.clean.fasta > \
galand_rdp_rhizobiales.combined.clean.fasta

And align.

mothur "#align.seqs(candidate=galand_rdp_rhizobiales.combined.clean.fasta,\
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(vertical=T, processors=8);\
screen.seqs(minlength=50)"

At this point it is generally preferable to look at the alignment and limit the analysis to the earliest and latest start position of your reference alignment.  You can get a rough fix on this by using the mothur command summary.seqs() on the reference only alignment.  You can then use screen.seqs() to eliminate any whacky reads that didn’t align right.  Here I don’t worry about it.  In either case be sure to convert all your gap characters to ‘-‘ as pplacer doesn’t like ‘.’.

tr "." "-" \
galand_rdp_rhizobiales.combined.clean.filter.good.fasta > \
galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta

mv galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta \
galand_rdp_rhizobiales.combined.final.fasta

With a lot of luck pplacer is happy with everything you’ve done up to this point, and won’t yell at you.  Go ahead and run it.  If you start seeing working on… messages roll across your screen you can sigh in relief.

One nasty thing that pplacer does if you aren’t looking out for it is produce multiple placements, as opposed to just the top placement.  If you want to keep things easy in downstream analysis tell it to just keep the best placement.

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p -o galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.combined.final.fasta

If pplacer ran successfully it’s time to make some visualizations.  The guppy program comes with pplacer and converts the .json file to phyloxml, newick, and csv formats.  It also can do some statistics and fancier things.  Here we use it to make a tree with edges fattened in proportion to the number of query sequences placed there.  After that we make a simple table of nodes and the number of placements at each.  This is great if you just want to make a histogram of your placements, which can be a lot easier than dealing with trees.

guppy fat --pp galand_rdp_rhizobiales.jplace

guppy to_csv --pp galand_rdp_rhizobiales.jplace > \
galand_rdp_rhizobiales.csv

Scenario 2 – using papara in place of the second round of mothur

Scenario 2 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with papara, followed by pplacer…

./papara -t RAxML_bestTree.rdp_rhizobiales_type.final \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp

mv papara_alignment.galand_rdp \
papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p \
-o papara_galand_rdp_rhizobiales.jplace \
papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp papara_galand_rdp_rhizobiales.jplace > \
papara_galand_rdp_rhizobiales.csv

Now you can make use of the guppy kr_heat command to highlight difference in the two trees.

guppy kr_heat --pp \
papara_galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.jplace

Visualize all the phyloxml output with Archaeopteryx.
Scenario 3 – use FastTreeMP in place of RAxML

Both pplacer and papara are pretty happy using trees produced by FastTree, which can be a pretty elegant alternative to RAxML.  Although it is much more streamlined than RAxML, it is not quite as accurate, and should not be used on alignments with fewer than 50 sequences.

Scenario 3 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with FastTreeMP followed by papara.  Note: I don’t think FastTreeMP was the default name of that binary, check when you install the parallel version of FastTree…

FastTreeMP -nt -gtr \
-log ft_rdp_rhizobiales_type.final.log \
rdp_rhizobiales_type.final.fasta >\
ft_rdp_rhizobiales_type.final.tre

rm -r ft_rdp_type_rhizobiales.refpkg

taxit create -l 16S -P ft_rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats ft_rdp_rhizobiales_type.final.log \
--tree-file ft_rdp_rhizobiales_type.final.tre

papara -t ft_rdp_rhizobiales_type.final.tre \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp_2

mv papara_alignment.galand_rdp \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c ft_rdp_type_rhizobiales.refpkg \
-p \
-o ft_papara_galand_rdp_rhizobiales.jplace \
ft_papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp ft_papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp ft_papara_galand_rdp_rhizobiales.jplace > \
ft_papara_galand_rdp_rhizobiales.csv

Below: Heat tree showing the difference between mothur and papara alignment after Scenario 2.  Reference tree is RAxML.

 

 
Scripts and data files
Posted in Research | 2 Comments

Halocarbons from young sea ice and frost flowers

An interesting paper come out recently by a group at the University of Gothenburg in Sweden.  Working at Svalbard, a common Arctic study site for European researchers, they measured the flux of organic compounds called halocarbons (or organohalides) from young sea ice and frost flowers.  Halocarbons are a class of chemicals defined by a halogen atom or atoms (fluorine, chlorine, bromine, or iodine) bound to a chain of carbon and hydrogen atoms.  The simplest halocarbons are the methylhalides, which contain only a single carbon.

Halogens are important reactive chemicals in the atmosphere, playing a role in ozone depletion (at ground level, which is good, and in the stratosphere, which is bad) and cloud condensation.  Halocarbons, especially the lower molecular weights ones like the methylhalides, are a source of halogens to the atmosphere.  When bound up in an organic molecule the halogen atoms are less reactive, and less prone to reacting with large, heavy molecules than when free.  This enables them to leave their site of origin (sea ice, seawater, leaf litter, a tree…).  Once in the atmosphere the light, unreactive halocarbons can be split apart via sunlight driven reactions, freeing the halogen as a reactive ion.

The concentration of common halocarbons in the atmosphere. Note the strong seasonal cycle for methylchloride, an indication that it is derived from photosynthetic organisms. Image from http://www.esrl.noaa.gov/gmd/hats.

Lots of people are interested in what produces the halocarbons that end up in the atmosphere.  There are many sources, including anthropogenic ones, but a major source is marine phytoplankton and algae.  Like terrestrial plants, these organisms produce oxygen during photosynthesis.  Oxygen, and the energy required to produce it, is dangerous stuff to have in a living cell.  Like gasoline in a car it can do a lot of damage if the stored energy gets released in the wrong way.  The damage oxygen does to cells is called oxidative stress, and cells stockpile antioxidants as a defense against it.  There are also certain enzymes that have antioxidant functions, including a class of enzymes called haloperoxidases that act on hydrogen peroxide (a highly oxidizing byproduct of photosynthesis).  As the name implies, halogen atoms play a role in the reaction catalyzed by haloperoxidases.  The product of the reaction is a halogen that is short an electron, making it particularly reactive.  It can react further with another molecule of hydrogen peroxide regenerating the original halogen ion.  Alternatively it can receive a carbon atom from another enzyme, making it a methylhalide.

The general reaction of haloperoxidase and hydrogen peroxide (Butler and Walker, 1993).

What does all the have to do with frost flowers and young sea ice?  Phytoplankton in the water column are preferentially entrained in young sea ice as the ice forms.  There, close to the light, they continue photosynthesis.  Halocarbons produced by their haloperoxidases are released into the brine network in sea ice rather than seawater.  As continued ice growth pushes this brine to the surface (some of which ends up in frost flowers) the halocarbons may end up in the atmosphere much quicker than those produced by phytoplankton in the water column.  The data seems to support this, halocarbon concentrations in young sea ice peak near the surface several days after ice formation (perhaps when the entrained phytoplankton become adapted to their new home and resume photosynthesis).  After several days however, halocarbon concentations at the ice surface are greatly diminished, possibly due to the loss of connectivity in the brine veins as the ice temperature decreases.

Bromoform, the most abundant halocarbon measured in the study, peaks close to the ice surface at around day 12 (Granfors et al., 2013).

There is also a possible bacteria source of halocarbons in young sea ice and frost flowers, tying this work to our recent publication in EMIR (Bowman and Deming, 2013).  There we reported the presence of an unusual community of bacteria that we hypothesize are symbionts of phytoplankton trapped in the ice.  Many of these bacteria belong to taxonomic groups known to produce halocarbons using haloperoxidases similar to those contained in phytoplankton.  They do this to alleviate the oxidative stress of their phytoplantkon partners, which enables the latter to undergo increased levels of photosynthesis (which ultimately benefits the bacteria).

We have not observed many phytoplankton within frost flowers or at the ice surface, conditions there are probably too hostile for them,  but there is an abundance of hydrogen peroxide and other oxidizing compounds produced by sunlight at the ice surface.  Perhaps the bacteria in frost flowers, for the short time they are able to survive, are engaged in their normal task of eliminating these harmful compounds and producing halocarbons.  Unfortunately we have not yet undertaken measurements of halocarbons and other biogenic compounds in parallel with analyses of the frost flower microbial community.  Until we do this we can only speculate!

Posted in Research | Leave a comment

Compositional vectors revisited

A couple of articles back I posted a down-and-dirty method for calculating compositional vectors in a whole genome.  I’m using compositional vectors in one of my projects so I reworked the code to make the analysis more rigorous, implementing the suggestion to use Python’s itertools function, shifting the alphabet to amino acid space, limiting the composition to products from open reading frames, and applying the correction from Hao and Qi, 2004.  The correction relies on the calculation of k – 1 and k – 2 length compositional vectors, which are the k1 and k2 sections in the following block.  The k section is the desired k length vector, and k0 is the normalized vector.

#### load some modules we will need ####

import os
import subprocess
import re
import gzip
import cPickle
from Bio import SeqIO
from joblib import Parallel, delayed

#### generate all possible kmers ####

k = 5
k1 = k - 1
k2 = k - 2

## k      

pros = list(itertools.repeat(pro, k))
bins = list(itertools.product(*pros))
new_bins = []    
for b in bins:
    b = ''.join(b)
    new_bins.append(b)
bins = new_bins
nmers = open(str(k)+'mers_pro.set', 'wb')                   
cPickle.dump(bins, nmers)
nmers.close()
itertools.product()

## k1

pros = list(itertools.repeat(pro, k1))
k1_bins = list(itertools.product(*pros))
k1_new_bins = []    
for b in k1_bins:
    b = ''.join(b)
    k1_new_bins.append(b)
k1_bins = k1_new_bins
k1_nmers = open(str(k1)+'mers_pro.set', 'wb')                   
cPickle.dump(k1_bins, k1_nmers)
k1_nmers.close()
itertools.product()    

## k2

pros = list(itertools.repeat(pro, k2))
k2_bins = list(itertools.product(*pros))
k2_new_bins = []    
for b in k2_bins:
    b = ''.join(b)
    k2_new_bins.append(b)
k2_bins = k2_new_bins
k2_nmers = open(str(k2)+'mers_pro.set', 'wb')                   
cPickle.dump(k2_bins, k2_nmers)
k2_nmers.close()
itertools.product()

Once all possible words have been identified for k, k1, and k2, I count their occurrence in the proteome and the normalize the count.  The code doesn’t show where “name” comes from, in my case I’ve already looped over the files in the directory and generates a list of file prefixes for the proteomes I want to analyze (code not shown).  I append ‘_pro.fasta’ to each prefix to get back to the actual file name.

#### find normalized kmers in proteome ####

def calc_vector(name, bins, k1_bins, k2_bins):
    k1_found_bins = {}
    k1_used_bins = set()
    k2_found_bins = {}
    k2_used_bins = set()
    found_bins = {}
    used_bins = set()

    seqs = name+'_pro.fasta'
    for record in SeqIO.parse(seqs, 'fasta'):
        query = str(record.seq)

        ## k1 and k2

        for i in range(0,len(query)):
            kmer = query[i:i+k1]
            print name, 'k1', i
            if kmer not in k1_used_bins:
                k1_found_bins[kmer] = 1
                k1_used_bins.add(kmer)
            else:
                k1_found_bins[kmer] = k1_found_bins[kmer] + 1  

        for i in range(0,len(query)):
            kmer = query[i:i+k2]
            print name, 'k2', i
            if kmer not in k2_used_bins:
                k2_found_bins[kmer] = 1
                k2_used_bins.add(kmer)
            else:
                k2_found_bins[kmer] = k2_found_bins[kmer] + 1

        ## k

        for i in range(0,len(query)):
            kmer = query[i:i+k]
            print name, 'k', i
            if kmer not in used_bins:
                found_bins[kmer] = 1
                used_bins.add(kmer)
            else:
                found_bins[kmer] = found_bins[kmer] + 1

    ## k0

    norm_bins = {}
    for kmer in found_bins.keys():
        if len(kmer) == k:
            kmer_1 = kmer[0:-1]
            kmer_2 = kmer[1:]
            kmer_3 = kmer[1:-1]
            bigL = len(query)
            kmer_0 = ((k1_found_bins[kmer_1] * k1_found_bins[kmer_2])
            / float(k2_found_bins[kmer_3])) * (((bigL - k + 1) * (bigL - k + 3))
            / float((bigL - k + 2) ** 2))
            kmer_norm = (found_bins[kmer] - kmer_0) / kmer_0
            norm_bins[kmer] = kmer_norm
            print name, kmer, kmer_norm

    ## fill out dictionary with 0 values for unrepresented kmers

    for nmer in bins:
        if nmer not in used_bins:
            norm_bins[nmer] = 0

    with gzip.open(name+'_'+str(k)+'mer_bins.txt.gz', 'wb') as bins_out:
        for each in sorted(norm_bins.keys()):
            print >> bins_out, each+'\t'+str(norm_bins[each])

It takes a little bit of time to calculate the normalized vector for each proteome you wish to analyze.  One way to greatly speed things up is to analyze the proteomes in parallel.  I’m using the Parallel function from the joblib module for this.  Getting it to work right wasn’t straightforward, and the rather kludgy method of printing each compositional vector to file in the above function was my workaround for not being able to create a dictionary of vectors in a parallelized loop.  In the below loop I’m generating a compositional vector for each proteome specified in names, starting a new process for each proteome up to the number of cpus available on the system (n_jobs = -1).

#### the the function in parallel on input files ####
Parallel(n_jobs = -1, verbose = 5)(delayed(calc_vector)
(name, bins, k1_bins, k2_bins) for name in names)

The final piece of the puzzle is to bring together the individual vectors into a single matrix that can be converted to a distance matrix (or whatever you wish to do with it).  There are a number of ways to do this.  I chose to generate a dictionary of all vectors, with the “name” as key.  The bins are in order (or should be!) in the list that makes up each dictionary value.  It potentially took a long time to calculate the matrix, so I save the dictionary as a pickle so that I can access the dictionary anytime in the future without having to re-calculate anything.  Note that the matrix is transposed, if you write to a text file be sure to change the orientation.  R and Matlab (I believe) will not read a matrix with 3.2 million columns (that many rows, and many more, are okay)!

#### bring the vectors together and create some permanent ####
#### output                                               ####

final_bins = {}
for f in os.listdir('.'):
    if f.endswith('bins.txt.gz'):
        name = re.split('_'+str(k)+'mer_bins', f)
        name = name[0]
        print name
        with gzip.open(f, 'rb') as bin_file:
            temp = []
            for line in bin_file:
                line = line.rstrip('\n')
                line = line.split('\t')
                bin_id = line[0]
                bin_num = line[1]
                temp.append(bin_num)
            final_bins[name] = temp

cPickle.dump(final_bins, open(str(k)+'mer_normalized_phylogeny_vector_output.p', 'wb'))
Posted in Research | Leave a comment

Top 10 scripting tricks for basic bioinformatics

In the process of preparing some data for publication I needed to revisit some scripts I wrote several months ago, clean them up, and re-run them.  I was pretty amazed with how bad they were, and ended up spending a couple of days rewriting them (necessary?  No… but it felt good).  Many of the changes between the old versions and the new version came down to the implementation of a relatively small number of tools.  Here are the top ten elements present in scripts I write now that have resulted in huge performance improvements over scripts I wrote, say, a year ago.  I can only hope the pace of improvement continues.  The elements are a mix of Python and shell commands or data structures.  If you’re an experienced hand at writing scripts these are pretty basic tips, but if you’re just starting out you might find them useful.  If you’ve got a life-saving trick of your own leave a comment.

Number 10: the tab delimited format

I’ll be honest.  The xml format scares me, but not as bad as the newick format and some of the other more obscure “standard” data formats that crop up in bioinformatics.  Perhaps with more experience I’ll get over this, but for now nothing makes me happier than the tab-delimited format.  Easy to read, easy to parse.  Even some formats developed by people sophisticated enough to come up with something more complex are tab-delimited (the SAM format, for example).  The three tools I use the most on a day to day basis are bash, Python, and R.  With everything tab-delimited they all play nice together.  I try to keep as much data in this format as possible.

Number 9: head and tail

Not a scripting technique per se, and one of the first shell commands most people learn.  Have a 5 G tab-delimited file of unknown columns?  Use head -X to quickly access the first X number of lines.  Need a small test file for your next script?  Redirect that same head command to a file and you’re all set.  Same applies to tail, except it returns the bottom X lines.

Number 8: the redirect

Also one of the first shell techniques most people learn.  Redirects allow you to send files to shell commands and capture output.  For example if you want a file of all the sequence names in a very large fasta, grep and a redirect will do the trick:

fgrep '>' big.fasta > big.fasta.names.txt

Related to the redirect is the pipe.  You can send the output of one command straight to another with the “|” character.  For example if you need to concatenate thousands of fasta files and then compress them:

cat *.fasta | gzip

This eliminates the large, uncompressed intermediate file.

Number 7: dictionaries

I didn’t really get dictionaries when I took a class on Python as a new graduate student.  I understood what they did, but not how that was particularly useful.  In the last script I wrote I think I implemented four different dictionaries, mapping gi number to ncbi id, gi to taxonomic id, ncbi id to taxonomy, and finally actual sequence information to taxonomy (or something along those lines).  This multi-step mapping would have been extremely complex and inefficient using a different data structure.  One conceptual breakthrough came when I realized dictionary values could be anything – lists, sets, even other dictionaries.  It’s a great way to organize and quickly access a lot of information.

Number 6: parallel

Parallel computing baffles me a little.  I have about a 50 % success rate implementing parallel cluster computing on various projects using MPI, and a 0 % success rate implementing the Python parallel computing modules.  The gnu “parallel” command is a cozy warm blanket in the cold hard world of parallel computing.  If you have multiple cpus on the same motherboard (for example our lab’s MacPro workstation) and a job you can easily split into many small jobs, parallel allows simultaneous execution of as many jobs as you have cores.  It works great on loops in bash shells and also on multiple commands in a bash shell script.  I’ve found the simplest way to execute parallel is to use Python to split my job and create a bash script with one line per job (each line might contain multiple commands).  Redirecting the script into parallel results in parallel execution.  Parallel is smart enough to capture and sort output from all the threads, even if its going to the same place.

parallel < my_awesome_script.sh > my_dissertation.txt

Number 5: grep

Another basic shell command.  Grep, fgrep, and zgrep are magic commands that can extract critical lines of data based on pattern matching.  If you aren’t using a nested file format (if you are using tab-delimited or similar) the grep commands can quickly subset a massive file to just those lines you need.  It’s worth looking at all the options available for these commands, which make them quite powerful.

Number 4: “vectorized” commands

I’m not sure if “vectorized commands” is the right term, but I’ve heard that applied to R which discourages the use of loops and relies very heavily on commands that iterate intrinsically.  Python tends to rely more heavily on loops, but it has a number of these intrinsic iterators that are much, much faster than loops.  For example to match a string to a list of strings you might be tempted to use a loop:

for s in some_list:
     if my_string == s:
          do something

However the “in” command will perform much faster:

if my_string in some_list:
     do something

Number 3: gzip

Bioinformatics is big files.  You can’t escape from them.  The gigabytes keep growing until they’re terabytes, and then I don’t really want to think about what happens.  And of course you have to move all these files from computer to computer and try to do things with them.  Python (using the gzip module) and R (natively) both play well with gzip.  So does zgrep.  I try to keep everything larger than a couple of megabytes in the gzip format.  This saves a ton of space and of course file transfers are faster.

Number 2: with

With is another Python command that I didn’t really get at first.  Then came the day that I needed to open a many gigabyte file in Python and couldn’t do it.  The “with open(…) as handle:” command opens the file for reading without holding the whole thing in memory.  Add an iterator to loop across the lines.  Downside, it adds one more indentation level to that block of your code.  Upside, you can add multiple “open as” segments to the with line:

with open('file_1.txt', 'r') as file1, \
open('file_2.txt.gz', 'rb') as file2, \
open('output.txt', 'w') as output:
     for line in file1:
          do lots of stuff

Number 1: the set

The single most useful scripting trick I’ve learned over the last year is the set.  Sets are containers of unique, unordered elements.  Searching across a set is lightning fast compared to searching a across a list.  Sets convert to and from ordered lists easily enough if ordering is important and duplicate entries are not.  I almost never search across Python lists anymore.  Sets have allowed me to create some scripts that will run overnight that would have taken weeks or months with lists.

Happy scripting!

Posted in Research | 2 Comments

The future of phylogeny

Yesterday I went to an inspiring informal seminar by Bailin Hao from Fudan University in Beijing.  Bailin presented work by his group to establish a new method for determining microbial phylogeny.  Inferring phylogeny is currently a massive headache in the field of microbiology.  Traditionally phylogeny was determined by aligning a marker gene (typically the 16S rRNA gene) from a set of bacteria, and inferring relatedness within the set from differences in these alignments.  These differences are the result of insertions, deletions, and base changes accumulated over time as the gene evolved independently for each analyzed strain.

Although this approach has served the study of microbial evolution and ecology well it has some serious shortcomings.  First, phylogenetic inference from  gene alignment is hard.  A lot of work has gone into developing sophisticated statistical models that predict how likely a predicted mutation will occur (and thus when gaps should appear in an alignment), but the best model can’t really capture what’s occurring in nature.  No alignment is perfect, and even imperfect alignments take a lot of time and computer power to calculate.  Second, genomes don’t always evolve in a predictable vertical fashion, with a daughter genome acquiring all the genes of the parent genome and perhaps a beneficial mutation or two.  Single celled organisms often pick up genes they encounter in the environment.  If beneficial those genes can find a permanent home in the genome.  There is no guarantee that the source organism of a “horizontally transferred” gene is closely related to the recipient, thus a phylogenetic inference based on a horizontally transferred gene will be incorrect with respect to the whole genome.  Third is the problem of what taxonomy means anyway?  If a bacterial genome is (hypothetically) 30 % horizontally acquired genes, and has a radically different phenotype than its closest relatives as a result of these transfers, what is the value of determining traditional taxonomy?

Enter compositional vector based phylogeny.  This method is just one of a new generation of sequence comparison approaches that use kmers to determine similarity.  A kmer is a word (of a length set by the investigator) created from a sliding window moving along the sequence in question.  For example the sequence AGCTCCGGC would have the 4 letter kmers AGCT, GCTC, CTCC, TCCG, CCGG, and CGGC.  The more closely related two sequences are, the more kmers they will have in common.  The kmers in common can be counted without aligning the two sequences, thus eliminating one of the most time and computer intensive parts of sequence comparison.

In the compositional vector approach to phylogeny a complete microbial proteome is converted to a vector of all possible amino acid kmers of a specified size.  For a five letter kmer the possible kmers are AAAAA through VVVVV, with 3,200,000 possible words (20 ^ 5).  Tallying the number of occurrences of each possible kmer in the proteome produces a vector.  Converting two or more vectors to a distance matrix provides a measure of phylogenetic distance.

What really struck me was the relative simplicity of this method.  In reality there are some important additional steps, namely the vectors need to be normalized to account for neutral evolution, but any investigator can use this method without sophisticated tools.  Python’s probably not the best language for this, but the concept can be explored with a simple Python script.  The script below will calculate a matrix of compositional vectors for some number of proteomes, WITHOUT the important normalization step.  It’s not fast, but it can calculate the compositional vectors for four proteomes (in the example obtained from UniProt) in a few minutes time.

import cPickle
import os

## if you haven't already done so generate all possible 
## five letter words

if '5mers.set' not in os.listdir('.'):
    aa = ['A','R','N','D','C','E','Q','G','H','I',\
    'L','K','M','F','P','S','T','W','Y','V']   
    bins = set()    
    temp = [0,0,0,0,0]

    for residue_1 in aa:
        temp[0] = residue_1
        for residue_2 in aa:
            temp[1] = residue_2
            for residue_3 in aa:
                temp[2] = residue_3
                for residue_4 in aa:
                    temp[3] = residue_4
                    for residue_5 in aa:
                        temp[4] = residue_5
                        temp_string = ''.join(temp)
                        print temp_string
                        bins.add(temp_string)

    fivemers = open('5mers.set', 'wb')                   
    cPickle.dump(bins, fivemers)
    fivemers.close()

else:
    fivemers = open('5mers.set', 'rb')
    bins = cPickle.load(fivemers)
    fivemers.close()

## set up a matrix to hold kmer tally
vec_matrix = []
for b in sorted(bins):
    l = [b]
    vec_matrix.append(l)

## concantenate proteins in multifasta to single string

for seq in ['34H_proteome.fasta', \
'agrobacterium_tumefaciens.fasta', \
'sinorhizobium_medicae.fasta', \
'sinorhizobium_meliloti.fasta']:    
    with open(seq, 'r') as fasta:
        query = ''
        for line in fasta:
            if line.startswith('>') == False:
                line = line.rstrip('\n')
                query = query+line

    ## search proteome for all 5 letter kmers, and tally occurrence

    found_bins = {}
    used_bins = set()   

    for i in range(0,len(query)):
        kmer = query[i:i+5]
        print i
        if kmer not in used_bins:
            found_bins[kmer] = 1
            used_bins.add(kmer)
        else:
            found_bins[kmer] = found_bins[kmer] + 1

    ## fill out dictionary with 0 values for unrepresented kmers

    for fivemer in bins:
        if fivemer not in used_bins:
            found_bins[fivemer] = 0

    for i,r in enumerate(sorted(bins)):
        vec_matrix[i].append(found_bins[r])

## print matrix as tab delimited text file

output = open('phylogeny_vector_output.txt', 'w')        
for row in vec_matrix:
    for i,e in enumerate(row):
        if i+1 != len(row):
            print >> output, str(e)+'\t',
        else:
            print >> output, str(e)   
output.close()

Here’s what the output looks like:

AAAAA	12	90	125	133
AAAAC	0	1	6	3
AAAAD	2	12	21	28
AAAAE	5	26	53	59
AAAAF	3	21	25	30
AAAAG	8	42	62	69
AAAAH	0	6	9	11
AAAAI	8	30	44	46
AAAAK	2	27	20	26
AAAAL	16	50	73	75

The whole matrix is ~ 50 mb, for anything larger than a few proteomes it would be advisable to compress the output (cool trick – I only recently learned that R will read in a .gz file with no additional prompting, this has saved me a ton of storage space of late).  To look at the relatedness of the strains you could use R or any other data analysis platform.  For this case a dendrogram derived from a distance matrix of the original data should illustrate compositional relatedness.  Again, I stress that this is an overly-simplistic interpretation of the actual method.  In R…

d <- t(as.matrix(read.table('phylogeny_vector_output.txt',
                            row.names = 1)))
d_dist <- dist(d, "euclidean")
d_clust <- hclust(d_dist)
plot(as.dendrogram(d_clust))

Voilà, a dendrogram of relatedness!  V2-5 correspond to the order of the original composition matrix, or Colwellia psychrerythraea, Agrobacterium tumefaciens, Sinorhizobium medicae, and Sinorhizobium meliloti.  As we would expect Colwellia, a Gammaproteobacteria, clusters separately from the other strains, all Alphaproteobacteria.  The two Sinorhizobium strains cluster closest.  For more about this approach, and details on the full method, check out Li, Xu, and Hao, 2010 in the Journal of Biotechnology.  The Hao lab has an excellent website in beta where all available genomes are represented on an interactive tree: http://tlife.fudan.edu.cn/cvtree/.

Dendrogram of Euclidean distance based on complete proteome compositional vectors for Colwellia psychrerythraea (V2), Agrobacterium tumefaciens (V3), Sinorhizobium medicae (V4), and Sinorhizobium meliloti (V5).

Posted in Research | 1 Comment

Somethings should be easy, and they are…

**UPDATE**
Got it.  You have to set the BLASTDB variable for this to work and place the taxonomy files in this same location.  I just did it in my .bash_profile:

BLASTDB=”/path/to/databases”
export BLASTDB

**UPDATE**
The original motivation for this was to obtain taxonomy data for blast search results.  Sometime in the last few months this feature actually slipped into the BLAST+ package.  I can’t actually get the feature to work, but perhaps you can!  It requires the installation of two special taxonomy files found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz.  Typically cryptic instructions can be found on the command line blast manual page at http://www.ncbi.nlm.nih.gov/books/NBK1763/, along with the necessary flags to get the taxonomy to appear (ha!) in table format output.  I suspect that my inability to get this to work stems from my dislike for mucking with the blast profile file.  Instead of setting a default database location I use the complete path in my calls.  Regardless of where I locate the taxonomy database blast can’t seem to find it 🙁

****

A couple of months ago I wrote a post on the difficulty of parsing the gi_taxid_nucl.dmp.gz file that maps sequence gi numbers to their NCBI taxonomy number (taxid).  With a taxid it is possible to obtain the actual taxonomy of (some) sequences in Genbank.  I proposed a couple of elaborate solutions, some of which were improved upon in the comments section by other contributors.  One particularly good suggestion was to use primary keys in a SQL database for fast look-ups.

In my latest analysis I have ~ 50,000 gi numbers to map to taxids and had to revisit this issue again.  After some tinkering I realized I’d missed a very obvious and reasonably efficient way to extract the relevant lines from the gi_taxid_nucl/prot.dmp.gz file.  If you have only a few gi numbers to map stick with the SQL or grep solutions.  If you have many give this method a try.  Making use of Python’s with statement it is possible to very quickly search each line of the .dmp.gz against a known collection of gi numbers contained in a set.  I only recently discovered Python sets, and they have already saved me numerous times.  Searching against items in a set is orders of magnitude faster than searching against items in a list.

# assuming that gis is a set of gi numbers...
import gzip
gi_taxid_output = open('small_gi_taxid.txt', 'w')
with gzip.open('gi_taxid_prot.dmp.gz', 'r') as dmp:
    found = 0
    for line in dmp:
        line_split = line.split('\t')
        print line_split[0], found
        if str(line_split[0]) in gis:
            found = found + 1
            print >> gi_taxid_output, line

Using this method a search for 50,000 gi numbers takes ~ 1 hour for gi_taxid_nucl.dmp.gz and about 30 minutes for gi_taxid_prot.dmp.gz.  The grep method, even utilizing the gnu parallel command for an 8 fold speed increase (on an 8 core machine), would have taken at least overnight and produced a messy gi_taxid file with many erroneous hits.

Posted in Research | 6 Comments

Entry-level computers for bioinformatics

When I started graduate school I new absolutely nothing about computing on anything more high performance than a laptop.  I assumed that clusters and servers were exclusive to large, well-funded labs.  Access to these items is a huge limiting factor in almost any field of research.  This is certainly the case in microbiology where it is increasingly difficult to close the loop on an interesting story without volumes of sequence data.  At the time (and for a long time after) I was indignant at this seeming injustice – only large labs with large budgets had access to the large computers required for the most compelling investigations (the results of which enable additional funding…).  Over time I’ve slowly come to appreciate that this is not the case.  A high performance computer is cheaper than many items deemed essential to a microbiology lab (-80 C freezer (~ $10 K), microscope (~ $20 K)).  I suspect there are other students out there with similar misconceptions, so here is a short compilation of things I’ve learned regarding research computers.  Imagine that you are a new faculty member, or a postdoc with a little fellowship money, looking to purchase a computer for basic bioinformatics (e.g. genome assembly and annotation).  What will you have to spend?

A necessary consideration is how you choose to use a computer in a research setting.  Our lab’s “server” is a MacPro workstation (more on that shortly).  After some trial and error I settled into an arrangement where I work almost exclusively from my laptop.  I use ssh and scp to access the server, and almost never sit in front of it.  The rest of this post assumes that the purchaser works in a similar fashion.  It’s a matter of personal style, other lab members like to park in front of the monitor (I’d rather park in the coffee shop, or at least my office).  Unlike almost everyone else in the world of bioinformatics my laptop is a PC.  Again, a matter of style.  It works great for me.  The point is that it works almost seamlessly with the MacPro, and all other Unix/Linux servers.  It really doesn’t matter what you connect to it with, the server is the important thing.  It’s worth bringing this up because, as a graduate student, I’d rather have a cheap laptop and a decent remote workstation than a nice laptop (of course in an ideal world you would have both).  If you can’t ask your advisor for both, ask for the workstation…

Our MacPro was purchased back in 2008, for all the right reasons.  Next generation sequencing, which generates orders of magnitude more data than older methods, had just been introduced.  A Mac with 8G of RAM and 8 cores seemed like more than enough computer.  Nobody in either of the labs using the computer knew much about Unix or Linux, so it didn’t seem like there was much of a choice.  Unfortunately in this decision a cardinal rule of computers was violated – a computer deemed sufficient to do the job you are purchasing it for is likely to be insufficient for the next job.  It’s hard to justify spending  money on additional resources without a clear use for them, but it just has to be done in some cases.  When I started working with next generation data I upped the RAM to 24 G and added an additional 2T HD.  Theses upgrades make the MacPro suitable for 75 % of the tasks I need to perform on a day to day basis, but the other 25 % is kind of essential…

The top of the line MacPro today, with 8T of storage, 12 cores, and 64 G of RAM will run you around $6,600 (tower only).  For a long time I assumed this was the only way to go – more capable workstations must be prohibitively more expensive, right?  Since there’s a lot that you can’t do with only 64 G RAM and 12 cores I assumed that that work was the sole purview of well funded labs.  Sometime back I explained this dilemma to a colleague in our geology department who deals with large geospatial datasets, and who clued me in on the price of computers.

Mac’s are friendly,  but some friends can be bossy and expensive to hang out with.  In this case it takes a bit of tinkering to make a Mac work well as a Unix server.  If that’s how you’re using it than you aren’t using the features that you’ve paid a lot of money for – namely the Mac OS GUI front end.  True Unix/Linux workstations are comparatively cheap.  For ~ 5 K I found two local companies who could provide a server-grade tower with 128 G RAM (minimal for many applications, but expandable to 512 G), and 2 x 8 Intel Xeon processors (16 cores total).  There’s plenty that such a machine can’t do, but it’s not a bad start.

I know that other groups have migrated to Amazon ECS for research computing, but there are clear downsides.  Amazon “instances” are expensive, and limited to 64 G RAM.  If you aren’t a computer expert you could spend a lot of money just trying to figure out how to make things work.  It also seems that data exploration would be limited by cost, and by the slight difficulty inherent in setting up instances.  Most research institutions have some kind of high performance computer environment; I’ve used the University of Washington’s Hyak cluster for a lot of my work.  This environment is great when many processors are desired, but, as with Amazon ECS, it is expensive and memory limited.  I’m pretty new at this and just getting a sense of what’s out there.  If you’ve had a good or bad experience with any of this technology leave a comment!

Posted in Research | 1 Comment