Phylogenetic Placement Revisited

UW Oceanography has a student-lead bioinformatics seminar that meets every spring and fall (or at least has since the fall of 2012 – we hope to continue it).  Each quarter is a little different, this time around each seminar attendee will present on a tool or set of tools that they’ve applied in their research.  It’s a good way for us to keep up on the ever-changing slate of available tools for sequence analysis, and to delve a little deeper into the ones we currently use.  This week its my turn, and I’ve prepared a short tutorial on pplacer, a common phylogenetic placement program.  I wrote a post about phylogenetic placement a few months ago.

There is some unpublished data posted on the seminar website, so it unfortunately isn’t public.  To make the tutorial available I’ve copied it here.  What was great about putting this tutorial together was the opportunity to try papara, an alignment program developed by the same group responsible for RAxML.  Check out Scenario 2 in the tutorial for details on papara.

**************************************************************************

This is a tutorial for using pplacer with 16S rRNA genes under two three separate scenarios:

1.  Reference and reference-query alignment using mothur, tree building using RAxML
2.  Reference alignment using mothur, reference-query alignment using papara, tree building using RAxML
3.  Reference alignment using mothur, reference-query alignment using papara, tree building using FastTreeMP

There is really no difference between these 16S rRNA examples and execution with functional gene or aa data, except that of course you would need an alternate alignment strategy (clustalo, hmmalign, muscle) to generate the reference alignment.

Raw data files and scripts appear at the bottom of this page.  You will need to download the following:

mothur
RAxML (compile the PTHREADS versions)
FastTree (use the MP version)
papara – make sure you have recent versions of boost and the gcc compiler, does not compile with gcc v4.2
pplacer (and taxtastic, guppy)
Archaeopteryx

The following commands are specific to the system they were run on, so be sure to chance directory paths where necessary and the number of cpus.  If you are running this on a mac’nbook set cpus/threads to 2 for mothur and RAxML.  For pplacer to work well the query dataset should fit within the taxonomic range represented by the reference tree.  In this example a set of 16S rRNA reads were classified at the level of order, and we will try to place these reads on a reference tree constructed from full length 16S sequences spanning that order.

Scenario 1 – Reference and reference-query alignment using mothur, tree building using RAxML

First, clean all punctuation from sequence names, note that this will also degap the sequences.  This step is necessary because pplacer tends to break on punctuation.

tr " " "_" < rdp_rhizobiales_type.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > rdp_rhizobiales_type.clean.fasta

Now use mothur to align the reference sequences against an existing high-quality (we hope) alignment.  I’ve used a concatenation of the Eukaryote, Bacteria, and Archaea Silva alignments available on the mothur website.  It’s too big to share on this site, and is overkill for this analysis, but easy to put together if you want to.  Otherwise just use the Bacteria reference file found here.

mothur "#align.seqs(candidate=rdp_rhizobiales_type.clean.fasta, \
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(trump=., vertical=T, processors=8);\
unique.seqs()"

## rename to something convenient

mv rdp_rhizobiales_type.clean.filter.unique.fasta \
rdp_rhizobiales_type.final.fasta

Now build the reference tree with RAxML.  You’ll note that the tree is not bootstrapped, pplacer can’t handle the statistics file that comes with the bootstrapped tree.  It is possible to work around this by generating multiple trees (with multiple statistics files), here we just aren’t going to worry about it.  In practice reviewers don’t seem to worry that much about it either.  Note that we need the alignment in phy format, and that RAxML won’t write over files of the same name.

perl ./fasta2phyml.pl rdp_rhizobiales_type.final.fasta
rm RAxML*
raxmlHPC-Pthreads -m GTRGAMMA \
-n rdp_rhizobiales_type.final \
-p 1234 -T 8 \
-s rdp_rhizobiales_type.final.phymlAln

pplacer wants the original fasta reference alignment, the reference tree, and the tree building statistics file all packaged together in one re-usable package.  The use of reference packages also aids in the implementation of some of pplacers fancier features, like taxonomic classification.  Ref package creation is accomplished with the taxit command from taxtastic, which comes bundled with pplacer.  Taxit won’t overwrite an existing reference package, so it’s useful to remove it beforehand.

rm -r rdp_type_rhizobiales.refpkg
taxit create -l 16S -P rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats RAxML_info.rdp_rhizobiales_type.final \
--tree-file RAxML_bestTree.rdp_rhizobiales_type.final

Now we prep the query reads in the same manner as the reference sequences.

tr " " "_" < galand_2009_rhizobiales.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > galand_2009_rhizobiales.clean.fasta

Combine the query and reference sequences.

cat galand_2009_rhizobiales.clean.fasta \
rdp_rhizobiales_type.clean.fasta > \
galand_rdp_rhizobiales.combined.clean.fasta

And align.

mothur "#align.seqs(candidate=galand_rdp_rhizobiales.combined.clean.fasta,\
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(vertical=T, processors=8);\
screen.seqs(minlength=50)"

At this point it is generally preferable to look at the alignment and limit the analysis to the earliest and latest start position of your reference alignment.  You can get a rough fix on this by using the mothur command summary.seqs() on the reference only alignment.  You can then use screen.seqs() to eliminate any whacky reads that didn’t align right.  Here I don’t worry about it.  In either case be sure to convert all your gap characters to ‘-‘ as pplacer doesn’t like ‘.’.

tr "." "-" \
galand_rdp_rhizobiales.combined.clean.filter.good.fasta > \
galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta

mv galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta \
galand_rdp_rhizobiales.combined.final.fasta

With a lot of luck pplacer is happy with everything you’ve done up to this point, and won’t yell at you.  Go ahead and run it.  If you start seeing working on… messages roll across your screen you can sigh in relief.

One nasty thing that pplacer does if you aren’t looking out for it is produce multiple placements, as opposed to just the top placement.  If you want to keep things easy in downstream analysis tell it to just keep the best placement.

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p -o galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.combined.final.fasta

If pplacer ran successfully it’s time to make some visualizations.  The guppy program comes with pplacer and converts the .json file to phyloxml, newick, and csv formats.  It also can do some statistics and fancier things.  Here we use it to make a tree with edges fattened in proportion to the number of query sequences placed there.  After that we make a simple table of nodes and the number of placements at each.  This is great if you just want to make a histogram of your placements, which can be a lot easier than dealing with trees.

guppy fat --pp galand_rdp_rhizobiales.jplace

guppy to_csv --pp galand_rdp_rhizobiales.jplace > \
galand_rdp_rhizobiales.csv

Scenario 2 – using papara in place of the second round of mothur

Scenario 2 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with papara, followed by pplacer…

./papara -t RAxML_bestTree.rdp_rhizobiales_type.final \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp

mv papara_alignment.galand_rdp \
papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p \
-o papara_galand_rdp_rhizobiales.jplace \
papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp papara_galand_rdp_rhizobiales.jplace > \
papara_galand_rdp_rhizobiales.csv

Now you can make use of the guppy kr_heat command to highlight difference in the two trees.

guppy kr_heat --pp \
papara_galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.jplace

Visualize all the phyloxml output with Archaeopteryx.
Scenario 3 – use FastTreeMP in place of RAxML

Both pplacer and papara are pretty happy using trees produced by FastTree, which can be a pretty elegant alternative to RAxML.  Although it is much more streamlined than RAxML, it is not quite as accurate, and should not be used on alignments with fewer than 50 sequences.

Scenario 3 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with FastTreeMP followed by papara.  Note: I don’t think FastTreeMP was the default name of that binary, check when you install the parallel version of FastTree…

FastTreeMP -nt -gtr \
-log ft_rdp_rhizobiales_type.final.log \
rdp_rhizobiales_type.final.fasta >\
ft_rdp_rhizobiales_type.final.tre

rm -r ft_rdp_type_rhizobiales.refpkg

taxit create -l 16S -P ft_rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats ft_rdp_rhizobiales_type.final.log \
--tree-file ft_rdp_rhizobiales_type.final.tre

papara -t ft_rdp_rhizobiales_type.final.tre \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp_2

mv papara_alignment.galand_rdp \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c ft_rdp_type_rhizobiales.refpkg \
-p \
-o ft_papara_galand_rdp_rhizobiales.jplace \
ft_papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp ft_papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp ft_papara_galand_rdp_rhizobiales.jplace > \
ft_papara_galand_rdp_rhizobiales.csv

Below: Heat tree showing the difference between mothur and papara alignment after Scenario 2.  Reference tree is RAxML.

 

 
Scripts and data files
Posted in Research | 2 Comments

Budget Woes

Everyone’s impacted by the shutdown in the federal government in some way.  First and foremost are federal employees, including research scientists at NOAA, NASA, and the other research-minded federal agencies.  Not only are these individuals locked out of their workplaces, with their laptops confiscated, they are not allowed to work from home, in any capacity.  Consider the following case:

The shutdown coincided with the release of the 5th IPCC report, numerous federal research scientists are listed as authors of that document.  Several researchers from the NOAA Pacific Marine Environmental Laboratory (you can click that link, but of course it won’t take you anywhere) involved in the report, who are also adjunct faculty at the UW, had organized a town hall style event on the day of the shutdown to talk about the report.  They were threatened with a $5,000 fine if they violated the furlough by attending.  Instead a single UW professor, knowledgeable of but not an expert on the diverse topics covered in the report, presented on the report for a packed house.  An important outreach opportunity was much diminished – I suspect that Ted Cruz would view this as a minor victory.  Outreach is considered to be an essential part of science, particularly climate science, and is something that most of us volunteer some amount of our time for.  The strict rules regarding how this plays out for federal employees during a furlough highlights the absurdity of the current budget fight.

At a similar level of absurdity is the current situation with the US Antarctic Program, which has just been placed in “caretaker” status until the situation is resolved.  Apparently re-deploying personnel and equipment at great expense is the fiscally responsible thing to do…

Closer to home I think that I, and other graduate students on certain federal fellowships, won’t receive stipends until this is over (students whose fellowships are administered by their institution should be fine).  This isn’t entirely clear, but all the people who know whether or not we will get paid are on furlough.  Given that I make about as much as a graduate student as I did as a junior NCO 10 years ago, that’s going to be interesting.  At least no one is going to kick me out of my building and take my laptop…

Posted in Uncategorized | 1 Comment

A new model for scientific careers

Traditionally there were three ways that, having acquired a PhD, a researcher could continue with their scientific career:

1)  Seek a faculty job at an academic institution (typically a university).  University jobs are wonderful in that they allow for complete creativity.  No one but the faculty member sets his or her research agenda.  The downside is relatively low pay and very little chance at reasonable work-life balance.  The general divestment in science and education, particularly at the state level, means that faculty members have to raise an increasingly large portion of their lab operating expenses (including salary) from an increasingly small pool of federal funds.  Most new faculty members find themselves in their mid-30s, holding advanced degrees, and working very long hours for something south of $100,000 a year (except at the most prestigious and well-endowed institutions).  Such is the price of freedom.

2)  Find a government position.  The government does, with some reluctance, support a fair amount of mission driven basic science at various agencies.  Of particular note are NASA (and its quasi-government affiliates), the DOE, and NOAA.  Permanent government research positions generally offer better pay than academia and better work life balance.  Although the science pursued by government agencies is mission directed, there is a certain amount of flexibility.  Government scientists after all, determine how a research or monitoring mission is implemented at the agency level.  Despite these benefits there are some real drawbacks to conducting research at a federal agency.  The bureaucracy is inflexible (and efforts to reign in spending – generally not an issue at research minded agencies, only creates more) and subject to the drift of political whims.  Many federal researchers are impaired by travel restrictions, work stoppages (which even disallow work to continue for free, from a researcher’s home), and budget woes.  Your salary might be guaranteed, but what’s the point if your employer prevents you from doing interesting work and doing it well?

3)  Get a job in industry.  In academic circles this is viewed as going over to the dark side.  The pay is good, the work life balance is great, but the academic freedom is nonexistent.  Gone are the days of the Bell Labs, when industry supported basic research as a means of cultivating innovation (that job now falls to the federal government, through NSF and NIH).

So what’s a young researcher to do?  There is a fourth option available, that until recently was pursued by only a few researchers with either independent wealth or the discipline required to be their own administrator; the cottage research industry.  Particularly ideal for researchers with low infrastructure requirements (this won’t work if you need a Level 4 biosafety lab), in the cottage industry model you incorporate your own non-profit institution and set up your research environment in the basement (or local coffee shop).  So long as you can leverage a little federal funding (i.e. continue to write competitive research proposals) you can operate with a fantastic degree of flexibility.  There are downsides of course, foremost among them being your long term financial security and your ability to stay connected with your peers and on the cutting edge.

What is changing now is the emergence of virtual institutions that can take on the administrative role for cottage-industry researchers, and help them maintain the kind of collaborations that really make cutting edge research possible.  One of these institutions is the Blue Marble Space Institute, brainchild of Sanjoy Som, a graduate of the Astrobiology Program (via the Atmospheric Sciences PhD track) here at the University of Washington.  There is a vetting process to join Blue Marble but membership is non-competitive, anyone with legitimate research credentials can join.  Members of the institute can submit proposals to private and public funding sources, benefit from internal proposal peer review, and enjoy the kind of collaboration possible at a traditional academic institution.  Overhead (the amount that an institution adds to the proposal budget to cover “administrative costs”) is a fraction of what it is at a university, helping Blue Marble proposals look more competitive.

Blue Marble and similar virtual institutions have the potential to dramatically alter the academic career landscape.  I anticipate submitting an application to Blue Marble in the near future, membership will enable me to submit proposals in parallel with potential postdoctoral advisors (graduate students and even postdocs are typically not allowed to submit proposals to federal programs).  If I’m unable to obtain a postdoctoral appointment it also lets me keep my foot in the door, refining and submitting proposals until there’s a break.  I think the real advantage to Blue Marble however, is its ability to empower researchers who might opt out of the traditional track.  I’m thinking specifically of researchers (male or female) who want to raise a family and have an academic career.  Graduate students often romantically partner with other graduate students, a pairing which almost guarantees future family/career conflicts between two career minded individuals.  Blue Marble is a new tool in the toolkit for building an academic family, providing a mechanism for taking some time away from a brick and mortar institution while staying totally in the game.

And while I’m on the subject, Sanjoy has a second goal… the adoption of a symbol of global unity for the International Space Station.  The ISS is after all the most visible example of the possibilities of international space collaboration for the public good.  You can learn more about that project, and vote to have the patch adopted through CASIS, here.

 

Posted in Uncategorized | Leave a comment

Risk in Research

The polar research community was reminded of the risks in polar research yesterday with the news that a helicopter crash had killed two Canadian Coast Guard officers, including the commander of the venerated research icebreaker CCGS Amundsen, and scientist Klaus Hochheim.  I’m at the biennial Polar and Alpine Microbiology Conference right now, and there were many sad reactions to the news.  I haven’t had the privilege of working with any of the crash victims, but many here have.  The news resulted in a long discussion over dinner last night on risk in field research.  The question that never seemed to get answered was how much risk is acceptable?

Very often in response to tragedy the answer is determined, with very little discussion, to be zero.  Two cases were presented at dinner where, following non-lethal accidents, unilateral decisions were made to reduce risk to zero.  These two cases are very different from the tragedy on Monday – the accidents occurred during non-essential activities and on ships belonging to UNOLS, the University National Oceanic Laboratory System (the US fleet of academic research vessels).  The policy changes from these incidents however, are indicative of a risk-adverse mindset among science managers.  The lack of conversation on what is an acceptable level of risk is troubling.  Advocating for zero risk is easy, but is impractical, costly, and threatens the quality of research.  A quick look at the incidents in question:

DISCLAIMER – I made an effort to find some documentation on these incidents but LexisNexis didn’t turn up anything useful.  the following is lore that has past down through the oceanographic community.

Incident 1: Shark bites girl, girl bites UNOLS

UNOLS vessels used to hold “swim calls” while in tropical waters.  Researchers on long cruises could take a quick dip after weeks in a cramped, hot vessel.  A pretty good perk.  Perks are good for moral.  Moral is good for work.  Some years ago (?) a swim call was held after a vessel had been stationary for some time, and after scraps of food were dumped overboard.  There was a shark.  Someone got bit.  They sued.  UNOLS banned swim calls.

Incident 2:  Get drunk once, never drink again

This incident happened in the 70’s and details are even sketchier.  UNOLS vessels used to allow alcohol, but after an inebriated crewmember was injured (and sued…) all alcohol was banned.  The US is unique among the major nations that conduct oceanographic research in having dry ships (expeditions on foreign vessels are very popular).

You might say so what?  In neither case did a ban really reduce the ability of the UNOLS fleet to conduct research.  I’ve even heard from some female colleagues that they favor the alcohol ban – weeks at sea with a majority male crew can be awkward enough without alcohol.  The problem is the instigation of rules disproportionate to the incidents, the mindset that seeks zero risk.  Perhaps this results from the reasonable desire of UNOLS to not be sued by an individual over the consequences of a choice he or she makes.  What is baffling is the rejection of the obvious alternative: inform individuals of the risks of field research, and then hold them accountable for the decisions they make in the field.

At various times I’ve worked as a divemaster, whitewater kayak instructor, and raft guide.  There is a small amount of risk involved in participating in any one of these leisure activities, and every year people die or are seriously injured doing them.  Despite this a sort of equilibrium has been reached – people are informed of the risk, acknowledge it in consent forms,  and participate.  There is no need to eliminate all the risk – we don’t restrict rafting tours to flat sections of river or scuba divers to water they can stand up in.  I’ve never once signed a liability waiver as a researcher, despite working in some very challenging environments.  Instead it was assumed that bureaucracy had rounded all the sharp corners, and that I was free to sue if I stubbed my toe on one they missed.

There is no way to know what policy changes will result from Monday’s incident.  It would be crass to speculate at a time when families and colleagues are coming to terms with the tragedy, and irresponsible when so few details are known.  We are all reminded that risk comes with the privilege of pursuing science outside the office.  I hope that as a community (and for that matter, as a nation) we can learn to approach the concept of risk rationally.

Posted in Uncategorized | Leave a comment

Halocarbons from young sea ice and frost flowers

An interesting paper come out recently by a group at the University of Gothenburg in Sweden.  Working at Svalbard, a common Arctic study site for European researchers, they measured the flux of organic compounds called halocarbons (or organohalides) from young sea ice and frost flowers.  Halocarbons are a class of chemicals defined by a halogen atom or atoms (fluorine, chlorine, bromine, or iodine) bound to a chain of carbon and hydrogen atoms.  The simplest halocarbons are the methylhalides, which contain only a single carbon.

Halogens are important reactive chemicals in the atmosphere, playing a role in ozone depletion (at ground level, which is good, and in the stratosphere, which is bad) and cloud condensation.  Halocarbons, especially the lower molecular weights ones like the methylhalides, are a source of halogens to the atmosphere.  When bound up in an organic molecule the halogen atoms are less reactive, and less prone to reacting with large, heavy molecules than when free.  This enables them to leave their site of origin (sea ice, seawater, leaf litter, a tree…).  Once in the atmosphere the light, unreactive halocarbons can be split apart via sunlight driven reactions, freeing the halogen as a reactive ion.

The concentration of common halocarbons in the atmosphere. Note the strong seasonal cycle for methylchloride, an indication that it is derived from photosynthetic organisms. Image from http://www.esrl.noaa.gov/gmd/hats.

Lots of people are interested in what produces the halocarbons that end up in the atmosphere.  There are many sources, including anthropogenic ones, but a major source is marine phytoplankton and algae.  Like terrestrial plants, these organisms produce oxygen during photosynthesis.  Oxygen, and the energy required to produce it, is dangerous stuff to have in a living cell.  Like gasoline in a car it can do a lot of damage if the stored energy gets released in the wrong way.  The damage oxygen does to cells is called oxidative stress, and cells stockpile antioxidants as a defense against it.  There are also certain enzymes that have antioxidant functions, including a class of enzymes called haloperoxidases that act on hydrogen peroxide (a highly oxidizing byproduct of photosynthesis).  As the name implies, halogen atoms play a role in the reaction catalyzed by haloperoxidases.  The product of the reaction is a halogen that is short an electron, making it particularly reactive.  It can react further with another molecule of hydrogen peroxide regenerating the original halogen ion.  Alternatively it can receive a carbon atom from another enzyme, making it a methylhalide.

The general reaction of haloperoxidase and hydrogen peroxide (Butler and Walker, 1993).

What does all the have to do with frost flowers and young sea ice?  Phytoplankton in the water column are preferentially entrained in young sea ice as the ice forms.  There, close to the light, they continue photosynthesis.  Halocarbons produced by their haloperoxidases are released into the brine network in sea ice rather than seawater.  As continued ice growth pushes this brine to the surface (some of which ends up in frost flowers) the halocarbons may end up in the atmosphere much quicker than those produced by phytoplankton in the water column.  The data seems to support this, halocarbon concentrations in young sea ice peak near the surface several days after ice formation (perhaps when the entrained phytoplankton become adapted to their new home and resume photosynthesis).  After several days however, halocarbon concentations at the ice surface are greatly diminished, possibly due to the loss of connectivity in the brine veins as the ice temperature decreases.

Bromoform, the most abundant halocarbon measured in the study, peaks close to the ice surface at around day 12 (Granfors et al., 2013).

There is also a possible bacteria source of halocarbons in young sea ice and frost flowers, tying this work to our recent publication in EMIR (Bowman and Deming, 2013).  There we reported the presence of an unusual community of bacteria that we hypothesize are symbionts of phytoplankton trapped in the ice.  Many of these bacteria belong to taxonomic groups known to produce halocarbons using haloperoxidases similar to those contained in phytoplankton.  They do this to alleviate the oxidative stress of their phytoplantkon partners, which enables the latter to undergo increased levels of photosynthesis (which ultimately benefits the bacteria).

We have not observed many phytoplankton within frost flowers or at the ice surface, conditions there are probably too hostile for them,  but there is an abundance of hydrogen peroxide and other oxidizing compounds produced by sunlight at the ice surface.  Perhaps the bacteria in frost flowers, for the short time they are able to survive, are engaged in their normal task of eliminating these harmful compounds and producing halocarbons.  Unfortunately we have not yet undertaken measurements of halocarbons and other biogenic compounds in parallel with analyses of the frost flower microbial community.  Until we do this we can only speculate!

Posted in Research | Leave a comment

Compositional vectors revisited

A couple of articles back I posted a down-and-dirty method for calculating compositional vectors in a whole genome.  I’m using compositional vectors in one of my projects so I reworked the code to make the analysis more rigorous, implementing the suggestion to use Python’s itertools function, shifting the alphabet to amino acid space, limiting the composition to products from open reading frames, and applying the correction from Hao and Qi, 2004.  The correction relies on the calculation of k – 1 and k – 2 length compositional vectors, which are the k1 and k2 sections in the following block.  The k section is the desired k length vector, and k0 is the normalized vector.

#### load some modules we will need ####

import os
import subprocess
import re
import gzip
import cPickle
from Bio import SeqIO
from joblib import Parallel, delayed

#### generate all possible kmers ####

k = 5
k1 = k - 1
k2 = k - 2

## k      

pros = list(itertools.repeat(pro, k))
bins = list(itertools.product(*pros))
new_bins = []    
for b in bins:
    b = ''.join(b)
    new_bins.append(b)
bins = new_bins
nmers = open(str(k)+'mers_pro.set', 'wb')                   
cPickle.dump(bins, nmers)
nmers.close()
itertools.product()

## k1

pros = list(itertools.repeat(pro, k1))
k1_bins = list(itertools.product(*pros))
k1_new_bins = []    
for b in k1_bins:
    b = ''.join(b)
    k1_new_bins.append(b)
k1_bins = k1_new_bins
k1_nmers = open(str(k1)+'mers_pro.set', 'wb')                   
cPickle.dump(k1_bins, k1_nmers)
k1_nmers.close()
itertools.product()    

## k2

pros = list(itertools.repeat(pro, k2))
k2_bins = list(itertools.product(*pros))
k2_new_bins = []    
for b in k2_bins:
    b = ''.join(b)
    k2_new_bins.append(b)
k2_bins = k2_new_bins
k2_nmers = open(str(k2)+'mers_pro.set', 'wb')                   
cPickle.dump(k2_bins, k2_nmers)
k2_nmers.close()
itertools.product()

Once all possible words have been identified for k, k1, and k2, I count their occurrence in the proteome and the normalize the count.  The code doesn’t show where “name” comes from, in my case I’ve already looped over the files in the directory and generates a list of file prefixes for the proteomes I want to analyze (code not shown).  I append ‘_pro.fasta’ to each prefix to get back to the actual file name.

#### find normalized kmers in proteome ####

def calc_vector(name, bins, k1_bins, k2_bins):
    k1_found_bins = {}
    k1_used_bins = set()
    k2_found_bins = {}
    k2_used_bins = set()
    found_bins = {}
    used_bins = set()

    seqs = name+'_pro.fasta'
    for record in SeqIO.parse(seqs, 'fasta'):
        query = str(record.seq)

        ## k1 and k2

        for i in range(0,len(query)):
            kmer = query[i:i+k1]
            print name, 'k1', i
            if kmer not in k1_used_bins:
                k1_found_bins[kmer] = 1
                k1_used_bins.add(kmer)
            else:
                k1_found_bins[kmer] = k1_found_bins[kmer] + 1  

        for i in range(0,len(query)):
            kmer = query[i:i+k2]
            print name, 'k2', i
            if kmer not in k2_used_bins:
                k2_found_bins[kmer] = 1
                k2_used_bins.add(kmer)
            else:
                k2_found_bins[kmer] = k2_found_bins[kmer] + 1

        ## k

        for i in range(0,len(query)):
            kmer = query[i:i+k]
            print name, 'k', i
            if kmer not in used_bins:
                found_bins[kmer] = 1
                used_bins.add(kmer)
            else:
                found_bins[kmer] = found_bins[kmer] + 1

    ## k0

    norm_bins = {}
    for kmer in found_bins.keys():
        if len(kmer) == k:
            kmer_1 = kmer[0:-1]
            kmer_2 = kmer[1:]
            kmer_3 = kmer[1:-1]
            bigL = len(query)
            kmer_0 = ((k1_found_bins[kmer_1] * k1_found_bins[kmer_2])
            / float(k2_found_bins[kmer_3])) * (((bigL - k + 1) * (bigL - k + 3))
            / float((bigL - k + 2) ** 2))
            kmer_norm = (found_bins[kmer] - kmer_0) / kmer_0
            norm_bins[kmer] = kmer_norm
            print name, kmer, kmer_norm

    ## fill out dictionary with 0 values for unrepresented kmers

    for nmer in bins:
        if nmer not in used_bins:
            norm_bins[nmer] = 0

    with gzip.open(name+'_'+str(k)+'mer_bins.txt.gz', 'wb') as bins_out:
        for each in sorted(norm_bins.keys()):
            print >> bins_out, each+'\t'+str(norm_bins[each])

It takes a little bit of time to calculate the normalized vector for each proteome you wish to analyze.  One way to greatly speed things up is to analyze the proteomes in parallel.  I’m using the Parallel function from the joblib module for this.  Getting it to work right wasn’t straightforward, and the rather kludgy method of printing each compositional vector to file in the above function was my workaround for not being able to create a dictionary of vectors in a parallelized loop.  In the below loop I’m generating a compositional vector for each proteome specified in names, starting a new process for each proteome up to the number of cpus available on the system (n_jobs = -1).

#### the the function in parallel on input files ####
Parallel(n_jobs = -1, verbose = 5)(delayed(calc_vector)
(name, bins, k1_bins, k2_bins) for name in names)

The final piece of the puzzle is to bring together the individual vectors into a single matrix that can be converted to a distance matrix (or whatever you wish to do with it).  There are a number of ways to do this.  I chose to generate a dictionary of all vectors, with the “name” as key.  The bins are in order (or should be!) in the list that makes up each dictionary value.  It potentially took a long time to calculate the matrix, so I save the dictionary as a pickle so that I can access the dictionary anytime in the future without having to re-calculate anything.  Note that the matrix is transposed, if you write to a text file be sure to change the orientation.  R and Matlab (I believe) will not read a matrix with 3.2 million columns (that many rows, and many more, are okay)!

#### bring the vectors together and create some permanent ####
#### output                                               ####

final_bins = {}
for f in os.listdir('.'):
    if f.endswith('bins.txt.gz'):
        name = re.split('_'+str(k)+'mer_bins', f)
        name = name[0]
        print name
        with gzip.open(f, 'rb') as bin_file:
            temp = []
            for line in bin_file:
                line = line.rstrip('\n')
                line = line.split('\t')
                bin_id = line[0]
                bin_num = line[1]
                temp.append(bin_num)
            final_bins[name] = temp

cPickle.dump(final_bins, open(str(k)+'mer_normalized_phylogeny_vector_output.p', 'wb'))
Posted in Research | Leave a comment

Top 10 scripting tricks for basic bioinformatics

In the process of preparing some data for publication I needed to revisit some scripts I wrote several months ago, clean them up, and re-run them.  I was pretty amazed with how bad they were, and ended up spending a couple of days rewriting them (necessary?  No… but it felt good).  Many of the changes between the old versions and the new version came down to the implementation of a relatively small number of tools.  Here are the top ten elements present in scripts I write now that have resulted in huge performance improvements over scripts I wrote, say, a year ago.  I can only hope the pace of improvement continues.  The elements are a mix of Python and shell commands or data structures.  If you’re an experienced hand at writing scripts these are pretty basic tips, but if you’re just starting out you might find them useful.  If you’ve got a life-saving trick of your own leave a comment.

Number 10: the tab delimited format

I’ll be honest.  The xml format scares me, but not as bad as the newick format and some of the other more obscure “standard” data formats that crop up in bioinformatics.  Perhaps with more experience I’ll get over this, but for now nothing makes me happier than the tab-delimited format.  Easy to read, easy to parse.  Even some formats developed by people sophisticated enough to come up with something more complex are tab-delimited (the SAM format, for example).  The three tools I use the most on a day to day basis are bash, Python, and R.  With everything tab-delimited they all play nice together.  I try to keep as much data in this format as possible.

Number 9: head and tail

Not a scripting technique per se, and one of the first shell commands most people learn.  Have a 5 G tab-delimited file of unknown columns?  Use head -X to quickly access the first X number of lines.  Need a small test file for your next script?  Redirect that same head command to a file and you’re all set.  Same applies to tail, except it returns the bottom X lines.

Number 8: the redirect

Also one of the first shell techniques most people learn.  Redirects allow you to send files to shell commands and capture output.  For example if you want a file of all the sequence names in a very large fasta, grep and a redirect will do the trick:

fgrep '>' big.fasta > big.fasta.names.txt

Related to the redirect is the pipe.  You can send the output of one command straight to another with the “|” character.  For example if you need to concatenate thousands of fasta files and then compress them:

cat *.fasta | gzip

This eliminates the large, uncompressed intermediate file.

Number 7: dictionaries

I didn’t really get dictionaries when I took a class on Python as a new graduate student.  I understood what they did, but not how that was particularly useful.  In the last script I wrote I think I implemented four different dictionaries, mapping gi number to ncbi id, gi to taxonomic id, ncbi id to taxonomy, and finally actual sequence information to taxonomy (or something along those lines).  This multi-step mapping would have been extremely complex and inefficient using a different data structure.  One conceptual breakthrough came when I realized dictionary values could be anything – lists, sets, even other dictionaries.  It’s a great way to organize and quickly access a lot of information.

Number 6: parallel

Parallel computing baffles me a little.  I have about a 50 % success rate implementing parallel cluster computing on various projects using MPI, and a 0 % success rate implementing the Python parallel computing modules.  The gnu “parallel” command is a cozy warm blanket in the cold hard world of parallel computing.  If you have multiple cpus on the same motherboard (for example our lab’s MacPro workstation) and a job you can easily split into many small jobs, parallel allows simultaneous execution of as many jobs as you have cores.  It works great on loops in bash shells and also on multiple commands in a bash shell script.  I’ve found the simplest way to execute parallel is to use Python to split my job and create a bash script with one line per job (each line might contain multiple commands).  Redirecting the script into parallel results in parallel execution.  Parallel is smart enough to capture and sort output from all the threads, even if its going to the same place.

parallel < my_awesome_script.sh > my_dissertation.txt

Number 5: grep

Another basic shell command.  Grep, fgrep, and zgrep are magic commands that can extract critical lines of data based on pattern matching.  If you aren’t using a nested file format (if you are using tab-delimited or similar) the grep commands can quickly subset a massive file to just those lines you need.  It’s worth looking at all the options available for these commands, which make them quite powerful.

Number 4: “vectorized” commands

I’m not sure if “vectorized commands” is the right term, but I’ve heard that applied to R which discourages the use of loops and relies very heavily on commands that iterate intrinsically.  Python tends to rely more heavily on loops, but it has a number of these intrinsic iterators that are much, much faster than loops.  For example to match a string to a list of strings you might be tempted to use a loop:

for s in some_list:
     if my_string == s:
          do something

However the “in” command will perform much faster:

if my_string in some_list:
     do something

Number 3: gzip

Bioinformatics is big files.  You can’t escape from them.  The gigabytes keep growing until they’re terabytes, and then I don’t really want to think about what happens.  And of course you have to move all these files from computer to computer and try to do things with them.  Python (using the gzip module) and R (natively) both play well with gzip.  So does zgrep.  I try to keep everything larger than a couple of megabytes in the gzip format.  This saves a ton of space and of course file transfers are faster.

Number 2: with

With is another Python command that I didn’t really get at first.  Then came the day that I needed to open a many gigabyte file in Python and couldn’t do it.  The “with open(…) as handle:” command opens the file for reading without holding the whole thing in memory.  Add an iterator to loop across the lines.  Downside, it adds one more indentation level to that block of your code.  Upside, you can add multiple “open as” segments to the with line:

with open('file_1.txt', 'r') as file1, \
open('file_2.txt.gz', 'rb') as file2, \
open('output.txt', 'w') as output:
     for line in file1:
          do lots of stuff

Number 1: the set

The single most useful scripting trick I’ve learned over the last year is the set.  Sets are containers of unique, unordered elements.  Searching across a set is lightning fast compared to searching a across a list.  Sets convert to and from ordered lists easily enough if ordering is important and duplicate entries are not.  I almost never search across Python lists anymore.  Sets have allowed me to create some scripts that will run overnight that would have taken weeks or months with lists.

Happy scripting!

Posted in Research | 2 Comments

The future of phylogeny

Yesterday I went to an inspiring informal seminar by Bailin Hao from Fudan University in Beijing.  Bailin presented work by his group to establish a new method for determining microbial phylogeny.  Inferring phylogeny is currently a massive headache in the field of microbiology.  Traditionally phylogeny was determined by aligning a marker gene (typically the 16S rRNA gene) from a set of bacteria, and inferring relatedness within the set from differences in these alignments.  These differences are the result of insertions, deletions, and base changes accumulated over time as the gene evolved independently for each analyzed strain.

Although this approach has served the study of microbial evolution and ecology well it has some serious shortcomings.  First, phylogenetic inference from  gene alignment is hard.  A lot of work has gone into developing sophisticated statistical models that predict how likely a predicted mutation will occur (and thus when gaps should appear in an alignment), but the best model can’t really capture what’s occurring in nature.  No alignment is perfect, and even imperfect alignments take a lot of time and computer power to calculate.  Second, genomes don’t always evolve in a predictable vertical fashion, with a daughter genome acquiring all the genes of the parent genome and perhaps a beneficial mutation or two.  Single celled organisms often pick up genes they encounter in the environment.  If beneficial those genes can find a permanent home in the genome.  There is no guarantee that the source organism of a “horizontally transferred” gene is closely related to the recipient, thus a phylogenetic inference based on a horizontally transferred gene will be incorrect with respect to the whole genome.  Third is the problem of what taxonomy means anyway?  If a bacterial genome is (hypothetically) 30 % horizontally acquired genes, and has a radically different phenotype than its closest relatives as a result of these transfers, what is the value of determining traditional taxonomy?

Enter compositional vector based phylogeny.  This method is just one of a new generation of sequence comparison approaches that use kmers to determine similarity.  A kmer is a word (of a length set by the investigator) created from a sliding window moving along the sequence in question.  For example the sequence AGCTCCGGC would have the 4 letter kmers AGCT, GCTC, CTCC, TCCG, CCGG, and CGGC.  The more closely related two sequences are, the more kmers they will have in common.  The kmers in common can be counted without aligning the two sequences, thus eliminating one of the most time and computer intensive parts of sequence comparison.

In the compositional vector approach to phylogeny a complete microbial proteome is converted to a vector of all possible amino acid kmers of a specified size.  For a five letter kmer the possible kmers are AAAAA through VVVVV, with 3,200,000 possible words (20 ^ 5).  Tallying the number of occurrences of each possible kmer in the proteome produces a vector.  Converting two or more vectors to a distance matrix provides a measure of phylogenetic distance.

What really struck me was the relative simplicity of this method.  In reality there are some important additional steps, namely the vectors need to be normalized to account for neutral evolution, but any investigator can use this method without sophisticated tools.  Python’s probably not the best language for this, but the concept can be explored with a simple Python script.  The script below will calculate a matrix of compositional vectors for some number of proteomes, WITHOUT the important normalization step.  It’s not fast, but it can calculate the compositional vectors for four proteomes (in the example obtained from UniProt) in a few minutes time.

import cPickle
import os

## if you haven't already done so generate all possible 
## five letter words

if '5mers.set' not in os.listdir('.'):
    aa = ['A','R','N','D','C','E','Q','G','H','I',\
    'L','K','M','F','P','S','T','W','Y','V']   
    bins = set()    
    temp = [0,0,0,0,0]

    for residue_1 in aa:
        temp[0] = residue_1
        for residue_2 in aa:
            temp[1] = residue_2
            for residue_3 in aa:
                temp[2] = residue_3
                for residue_4 in aa:
                    temp[3] = residue_4
                    for residue_5 in aa:
                        temp[4] = residue_5
                        temp_string = ''.join(temp)
                        print temp_string
                        bins.add(temp_string)

    fivemers = open('5mers.set', 'wb')                   
    cPickle.dump(bins, fivemers)
    fivemers.close()

else:
    fivemers = open('5mers.set', 'rb')
    bins = cPickle.load(fivemers)
    fivemers.close()

## set up a matrix to hold kmer tally
vec_matrix = []
for b in sorted(bins):
    l = [b]
    vec_matrix.append(l)

## concantenate proteins in multifasta to single string

for seq in ['34H_proteome.fasta', \
'agrobacterium_tumefaciens.fasta', \
'sinorhizobium_medicae.fasta', \
'sinorhizobium_meliloti.fasta']:    
    with open(seq, 'r') as fasta:
        query = ''
        for line in fasta:
            if line.startswith('>') == False:
                line = line.rstrip('\n')
                query = query+line

    ## search proteome for all 5 letter kmers, and tally occurrence

    found_bins = {}
    used_bins = set()   

    for i in range(0,len(query)):
        kmer = query[i:i+5]
        print i
        if kmer not in used_bins:
            found_bins[kmer] = 1
            used_bins.add(kmer)
        else:
            found_bins[kmer] = found_bins[kmer] + 1

    ## fill out dictionary with 0 values for unrepresented kmers

    for fivemer in bins:
        if fivemer not in used_bins:
            found_bins[fivemer] = 0

    for i,r in enumerate(sorted(bins)):
        vec_matrix[i].append(found_bins[r])

## print matrix as tab delimited text file

output = open('phylogeny_vector_output.txt', 'w')        
for row in vec_matrix:
    for i,e in enumerate(row):
        if i+1 != len(row):
            print >> output, str(e)+'\t',
        else:
            print >> output, str(e)   
output.close()

Here’s what the output looks like:

AAAAA	12	90	125	133
AAAAC	0	1	6	3
AAAAD	2	12	21	28
AAAAE	5	26	53	59
AAAAF	3	21	25	30
AAAAG	8	42	62	69
AAAAH	0	6	9	11
AAAAI	8	30	44	46
AAAAK	2	27	20	26
AAAAL	16	50	73	75

The whole matrix is ~ 50 mb, for anything larger than a few proteomes it would be advisable to compress the output (cool trick – I only recently learned that R will read in a .gz file with no additional prompting, this has saved me a ton of storage space of late).  To look at the relatedness of the strains you could use R or any other data analysis platform.  For this case a dendrogram derived from a distance matrix of the original data should illustrate compositional relatedness.  Again, I stress that this is an overly-simplistic interpretation of the actual method.  In R…

d <- t(as.matrix(read.table('phylogeny_vector_output.txt',
                            row.names = 1)))
d_dist <- dist(d, "euclidean")
d_clust <- hclust(d_dist)
plot(as.dendrogram(d_clust))

Voilà, a dendrogram of relatedness!  V2-5 correspond to the order of the original composition matrix, or Colwellia psychrerythraea, Agrobacterium tumefaciens, Sinorhizobium medicae, and Sinorhizobium meliloti.  As we would expect Colwellia, a Gammaproteobacteria, clusters separately from the other strains, all Alphaproteobacteria.  The two Sinorhizobium strains cluster closest.  For more about this approach, and details on the full method, check out Li, Xu, and Hao, 2010 in the Journal of Biotechnology.  The Hao lab has an excellent website in beta where all available genomes are represented on an interactive tree: http://tlife.fudan.edu.cn/cvtree/.

Dendrogram of Euclidean distance based on complete proteome compositional vectors for Colwellia psychrerythraea (V2), Agrobacterium tumefaciens (V3), Sinorhizobium medicae (V4), and Sinorhizobium meliloti (V5).

Posted in Research | 1 Comment

Somethings should be easy, and they are…

**UPDATE**
Got it.  You have to set the BLASTDB variable for this to work and place the taxonomy files in this same location.  I just did it in my .bash_profile:

BLASTDB=”/path/to/databases”
export BLASTDB

**UPDATE**
The original motivation for this was to obtain taxonomy data for blast search results.  Sometime in the last few months this feature actually slipped into the BLAST+ package.  I can’t actually get the feature to work, but perhaps you can!  It requires the installation of two special taxonomy files found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz.  Typically cryptic instructions can be found on the command line blast manual page at http://www.ncbi.nlm.nih.gov/books/NBK1763/, along with the necessary flags to get the taxonomy to appear (ha!) in table format output.  I suspect that my inability to get this to work stems from my dislike for mucking with the blast profile file.  Instead of setting a default database location I use the complete path in my calls.  Regardless of where I locate the taxonomy database blast can’t seem to find it 🙁

****

A couple of months ago I wrote a post on the difficulty of parsing the gi_taxid_nucl.dmp.gz file that maps sequence gi numbers to their NCBI taxonomy number (taxid).  With a taxid it is possible to obtain the actual taxonomy of (some) sequences in Genbank.  I proposed a couple of elaborate solutions, some of which were improved upon in the comments section by other contributors.  One particularly good suggestion was to use primary keys in a SQL database for fast look-ups.

In my latest analysis I have ~ 50,000 gi numbers to map to taxids and had to revisit this issue again.  After some tinkering I realized I’d missed a very obvious and reasonably efficient way to extract the relevant lines from the gi_taxid_nucl/prot.dmp.gz file.  If you have only a few gi numbers to map stick with the SQL or grep solutions.  If you have many give this method a try.  Making use of Python’s with statement it is possible to very quickly search each line of the .dmp.gz against a known collection of gi numbers contained in a set.  I only recently discovered Python sets, and they have already saved me numerous times.  Searching against items in a set is orders of magnitude faster than searching against items in a list.

# assuming that gis is a set of gi numbers...
import gzip
gi_taxid_output = open('small_gi_taxid.txt', 'w')
with gzip.open('gi_taxid_prot.dmp.gz', 'r') as dmp:
    found = 0
    for line in dmp:
        line_split = line.split('\t')
        print line_split[0], found
        if str(line_split[0]) in gis:
            found = found + 1
            print >> gi_taxid_output, line

Using this method a search for 50,000 gi numbers takes ~ 1 hour for gi_taxid_nucl.dmp.gz and about 30 minutes for gi_taxid_prot.dmp.gz.  The grep method, even utilizing the gnu parallel command for an 8 fold speed increase (on an 8 core machine), would have taken at least overnight and produced a messy gi_taxid file with many erroneous hits.

Posted in Research | 6 Comments

A particularly bad idea

I haven’t written many posts on this blog that are political in nature, in fact I think the only political topic that’s come across is the need for new investment in the US ice breaker fleet.  A particularly troubling bill is under development in congress however, titled the “Quality in Research Act”, that, if passed, will do anything but guarantee high quality in our nation’s research.  As a working scientist it is difficult to not have some opinions about it.

A number of blogs have given the proposed act a pretty complete treatment, in particular this post on AmericanScience.  Rather than speculate on the motivation of the bill’s author (Lamar Smith (R-TX); naivety?  anti-science philosophy?), and rather than repeat what’s being said in other forums, I’ll just add a couple of points that seem to be missing from the discussion.  According to AmericanScience the bill will include new criteria for National Science Foundation (NSF) funded projects.  To meet the proposed criteria a project:

1.  Is in the interest of the United States to advance the national health, prosperity, or welfare, and to secure the national defense by promoting the progress of science.

2.  Is the finest quality, is ground breaking, and answers questions or solves problems that are of utmost importance to society at large; and

3.  Is not duplicative of other research projects being funded by the foundation or other Federal science agencies.

All three of these criteria have some deep flaws as written.  I’ll address the second and third criteria, which are the more troubling.  But first, to establish some context, what exactly is the National Science Foundation and how much money does the federal government spend on research?  NSF is one of several government or quasi-government entities that provides Federal funds for research.  It is not the largest such entity, the National Institute of Health (NIH) and the research branches of the Department of Defense (DOD) have much larger budgets.  NASA also has a much larger budget, but the amount spent on actual research is a bit less than what NSF invests.  NOAA, USGS, DOE, EPA and other agencies also conduct research, but this research tends to be mission-oriented and primarily conducted in-house.

The Obama administration’s request for NSF for 2014 amounts to $7.6 billion, a pretty small fraction of the federal budget.  By comparison the NIH request for 2014 is $31.33 billion.  I don’t begrudge NIH and NIH funded investigators a penny of that amount, but the difference highlights a flaw in how the Federal government thinks about science.  The mission of NIH is fundamentally different than that of NSF; the majority of NIH funds goes to applied research – concepts that had a genesis in basic research but that have matured to the point where they can begin to solve people’s problems.  NSF is the primary source of funding for that genesis – it is the single largest source of Federal funding for basic research.

This gets right to the heart of what’s wrong with the second point in the proposed criteria.  It suggests that the authors hope to eliminate the concept of basic research, by requiring projects to solve problems already identified as having the utmost importance to society.  I posit that many of the currently identified problems (and/or currently employed solutions) faced by society would not have been identified under this constraint.  A strength of the current process of peer review within NSF (and the other federal research agencies) is that one expert who sees a possible, unsolved piece of a potential problem can decide to pursue it further, if they can convince a jury of their peers and a project manager (typically a senior scientist on loan to NSF from academia) that the concept has merit.  This is no easy task; something like 10 % of all grant proposals to NSF actually get funded, but a well-made case stands a chance.

In the worst-case and rare scenario where NSF allocates funds to a duplicative or flawed project (a typical NSF project might be $300-400,000 over a three year period), these public funds are still not going to waste.  Unlike in the private sector there is no accumulation of wealth in academia.  Federal money flowing into academia flows right back out (typically to the private sector, but also right back to Federal coffers).  Of $400,000 allocated to a university professor for a 3 year grant, 50 % (varying by institution) goes to the host university as overhead.  With these funds the university supports capital projects (providing jobs), hires staff (more jobs), and improves the educational experience of its students (so they can get jobs).  Since the funded professor is unlikely to have time to actually conduct the research, they will hire a graduate student (a job, at less than the annual earnings of a Starbucks barista) and probably employ a couple of undergraduates part time.  The graduate student will need materials and equipment to carry out the research, which funnels a portion of the NSF investment straight to the private sector.  If there’s anything left from that original $400,000 the professor and graduate students might present their findings at a conference (more private sector – US airlines only, by existing rules!), and the professor might actually pay themselves some salary (they’ve been busy teaching, advising, conducting outreach, and otherwise supporting their institution while the graduate student carries out the project – but they are typically only paid part time for these activities).

That got a little longer than I originally intended, so I’ll only raise a quick objection to the third point in the proposed criteria.  It makes no sense to enforce a rule against duplicate research for two reasons.  First, the scientific community is quite good a policing itself on this account.  The reviewers on an NSF proposal review panel are working scientists.  Their own grants are competing for the same limited pool of funds.  Does Lamar Smith really think that they aren’t going to be harsh on a proposed project that will unnecessarily duplicate work already being funded?  Projects do need to overlap, and problems do need to be probed from different angles by different research groups.  It is probably not clear to someone outside a given field however, when two project are complimentary and when they simply duplicate work.  So why not leave this assessment to the experts on a review panel?

Clearly there is a lot that the scientific community can do better, in terms of how it conducts research and how it holds its members accountable.  There is a lot of momentum in the community to improve these things.  Despite these shortcomings science has done a pretty good job, fostering a system of (reasonably) open exchange, and a system for evidenced-based progression that is unparalleled elsewhere in society.  I can’t resist the barb that it’s a bit presumptuous for congress to question this system, at least before they develop one that works equally well…

Posted in Uncategorized | 1 Comment