Dispatch from AAAS-CASE

Salaries for University of Washington professors. There is some weirdness at the low end driven by mid-year promotions and positions with inexplicably low salaries. Positions at the high end are primarily within the medical school. Excluding positions < $30,000 per year the mean is $114,000 per year.

Size of single investigator awards granted by Antarctic Organisms and Ecosystems, within the Division of Polar Programs, Directorate of Geosciences.  The mean award amount is a bit shy of $400,000.

As mentioned in my previous post I spent the last three days at the AAAS-CASE workshop in Washington, DC.  The purpose of the workshop was to introduce junior scientists to the legislative process and the ins and outs of science funding at the federal level.  After two days of intense introduction, including a fascinating lecture from congressional expert Judy Schneider, we spent a day on Capital Hill meeting with staff members of our congressional delegations.  I made the rounds of the Washington delegation with Kelly Fleming, a graduate student in chemical engineering at the UW.

The Washington congressional delegation is strongly pro R&D.  The University of Washington brings a lot of federal money to the state, as does the Pacific Northwest National Laboratory and NOAA’s Pacific Marine Environmental Laboratory.  Additionally Boeing, Microsoft, and other large commercial interests in the state benefit heavily from federal R&D investment and a highly educated labor pool, and presumably lobby on behalf of sustained (or improved) funding.

Despite this pro-science stance I was initially surprised at how little congressional staffers understand about how funding works at the institutional level.  In retrospect this isn’t surprising at all.  After six years in graduate school I’m still not 100 % clear on how science funding works at the UW (see previous post on this issue here).  It quickly became clear that what we could best provide is some perspective on how increased grant funding directly translates to more jobs at institutions that receive federal funds and an improved quality of life.

Perhaps contrary to popular opinion academic salaries are not high.  For public universities salaries are a matter of public record, the salaries of UW professors can be found here.  It’s a bit difficult to pull a mean out, because certain appointments receive no salary, and there are duplicate entries for faculty members who were promoted in a given year.  I would guesstimate the mean at $114,000 per year – and this is dragged upwards a bit by some outlier salaries in well-paying departments (engineering, medicine, etc.).  On the surface the mean seems like a healthy salary.  Although it’s much lower than other fields with a comparable amount of graduate education (e.g. law, medicine), and much lower than what industry pays PhD researchers, this is offset by certain advantages (namely academic freedom and prestige).  If we dig a little deeper however, we see that the reality is quite different.

Most faculty appointments are only funded at 50 % of the stated salary (taking the mean, a paltry $57,000 per year).  That’s not a bad salary when you’re 30 – a bit worse if you’re trying to raise a family – but for a full professor at the end of a productive career its an undervaluation.  Why the limit to 50 % funding?  This guaranteed pay is for teaching classes and performing other service directly to the university.  The remainder of a faculty member’s salary is supposed to be raised from grants.  Salaries aren’t a great use of grant funding, but the system might be considered rational if grant funding was adequate.  To determine if that’s the case lets pull a typical grant apart.

In the course of recently preparing a grant proposal of my own I determined the mean size of single investigator awards for the NSF Antarctic Organisms and Ecosystems program, within the Division of Polar Programs, to be ~$400,000 over three years.  Here’s how a typical award to that program might break down, costs are very approximate but loosely based on UW rates:

Institutional Overhead at 56 % (i.e. indirect cost recovery, varies): $224,000
Equipment/analytical services (i.e. doing the actual science): $50,000
Consumables (the day to day lab stuff): $20,000
Graduate student operating costs (i.e. tuition for 3 years): $45,000
Graduate student stipend and benefits: $90,000
Partial salary for laboratory technician (someone’s got to teach the grad student): $30,000
Travel costs (assuming this isn’t fieldwork based): $10,000
Remaining for faculty salary: $-119,000

And therein lies the problem.  Faculty work around the financial limits as best they can by not hiring laboratory technicians, cutting back on analysis, and in some cases just dealing with less than 100 % pay (for far more than 40 hours of work a week).  Education, academic service, and research all suffer as a result.  So how to fix this problem?  The largest chunk of money that comes out of a grant is overhead, but how would the UW and other institutions function without the income derived from indirect cost recovery?  Vastly more funding to NSF, NIH, NASA, and the other granting agencies would help (and is a laudable goal anyway), but this isn’t likely to happen anytime soon.

I think the solution might lie in pursuing the more modest goal of retooling the way funds are allocated to investigators, while slowly pushing for more overall science funding.  Taking our previous example, by increasing the size of the award to $519,000 we at least break even (keeping in mind that the hypothetical project was pretty cheap, requiring no expensive field work, equipment, or analysis).  Adding three months of support for our beleaguered faculty member each year brings the total to approximately $609,000.  If we assume each faculty member has two active grants at a time (lucky them!) the math works okay, however two issues remain.

1) If every grant was 50 % larger, we need 50 % more funds to support the same number.

2)  Generally optimists say that 10 % of submitted grants get funded.  The number can be much lower for some divisions.

I don’t think that pushing to get 50 % more funds to the grant providing agencies is unrealistic (it’s a minuscule amount of the federal budget), but the political will to do this isn’t even on the horizon.  In the meantime making sure that grants are distributed equitably will help, though this goes against the grain of the brutally Darwinian academic culture where what matters is that an idea and implementation be best.  I think there is a balance to be struck here in terms of funding the best and funding what’s really good but also supports a promising researcher and their research team.  Encouraging adequate institutional support (which for the UW really means adequate state support) is also important.  Reduced overhead and alternate funding sources for graduate students and technical staff would really help stretch the federal grant dollars.

Number two is really the bigger issue.  I’m not familiar with NIH grant applications, but NSF grants require, in addition to a great idea, weeks of preparation.  I’m a reasonably fast writer and reader, but based on my own experiences I’d estimate that it takes about 120 hours of work to put together a good proposal and all the necessary supporting documents.  That’s assuming that you’ve already got a reasonably well-baked idea in your head and some preliminary data to show that the idea is well founded.

In our scenario a faculty member needs to fund two of these things every three years, which means that they have to submit about 20 (of course there will be some resubmissions in this total).  That’s 2,040 hours of work, or 680 hours each year – 17 work weeks at 40 hours per week, if we lived in a sane world where that was the norm.  Given the requirements for teaching, advising, service, and oh yeah, doing the actual research, I don’t think this is realistic.  One solution is to move grants to a five year cycle so that a faculty member needs to fund two grants every five years.  Yes, that makes each grant a bit larger, but it also reduces the number of active awards and has the additional advantage of bringing the grant cycle in (approximate) alignment with the length of time it takes a graduate student to get a PhD.

I had a great conversation with my colleagues at the Blue Marble Space Institute of Science about this issue this morning during the monthly “Beer with BMSIS” event (to be fair it was the afternoon in most of the country, and we only discuss a beer of choice, we don’t typically drink it).  These events are recorded, so you can listen here (and of course tune in the first Thursday of each month, you don’t need to be a member of BMSIS to join us – but it does help if you like beer).

 

Posted in Uncategorized | Leave a comment

AAAS-CASE

I’m currently in Washington, DC for the pilot AAAS-CASE (Catalyzing Advocacy and Science in Engineering) workshop.  I wanted to share a great short video that was sent to the attendees before the workshop.

http://www.innovationdeficit.org/

The video describes the “innovation deficit”, the gap between the federal investment in science and engineering, and what is needed to sustain new advances in energy, medicine, agriculture, technology and in our fundamental understanding of the world around us.  The problem isn’t difficult to understand, the entire NSF research budget is something like $7 billion (roughly 35 F-35 join strike fighters at the going rate – every single one of which flies or fails based on past federal expenditures on both basic and applied research).

How do we address the innovation deficit?  By building stronger connections between scientists and policy makers – two groups naturally ill at ease with one another.  The idea behind AAAS-CASE is to start a conversation among junior scientists about how they can guide better policy, and to equip them with the tools to engage with policy makers.  We’ll see how it goes…

Posted in Uncategorized | Leave a comment

Converting pfams to cogs

Anyone with a sequence to classify faces a bewildering array of database choice.  You’ve got your classic NCBI quiver; nt, nr, refseq, cdd, etc.  Then you’ve got uniprot, seed, pfam, and numerous others.  Some of these have obvious advantages or disadvantages.  The NCBI nr database for example, is comprehensive but cumbersome.  Choosing between others can at times feel arbitrary.

One database that I use a lot is pfam.  I like it because, in addition to being directly supported by the excellent protein classifier HMMER3.0, it is well curated and has helpful Wikipedia-like articles for some of the protein families.  Classification is based on the presence of conserved domains which is also a little easier to wrap your head around than the implications of sequence similarity alone.  The downside to pfam is that the presence of a conserved domain doesn’t always clue you in to the function of an unknown protein sequence.  There are also many thousands of protein domains. Sometimes you can guess at the function from the name.  Aldedh for example, is the aldehyde dehydrogenase family. Other names are more obtuse.  Abhydrolase_5 is a family containing the alpha/beta hydrolase fold domain, but unless you’re a biochemist the implications of that are probably a mystery (it was to me, fortunately there are the wiki articles).  What’s needed is a broader level of classification for these protein families.

NCBI used to maintain a really great database called COG (Clusters of Orthologous Genes).  Proteins were grouped into COGs which were given descriptive names like Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase.  Even better the COGs were assigned letters based on their perceived physiological function.  Some might argue that this is too coarse a classification, and they might be right, but it is still helpful when evaluating the ecological relevance of a protein.  Here are the COG codes, taken from the fun.txt file at the COG ftp site:

INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones

METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
[R] General function prediction only
[S] Function unknown

I decided to try and map the pfam database to the COG database so that I could have my cake and eat it too.  To do that I ran blastp on the Pfam-A.fasta file against a protein blast database created from cog0303.fa and used the following script to parse the tab-delimited blast output and various other files to create the mapping.  This was a reasonably large blast job and I ran into a computer issue part way through.  As a result the job didn’t complete.  When I have the chance to re-run I’ll post the completed pfam to cog table here. In the meantime here’s the script for anyone who would like to create their own pfam to cog map.  It is assumed that all database files are in the working directory.

## first step was to blast pfam-A against cog0303.fa
## blastp -db COG/cog0303 -query Pfam-A.fasta -out Pfam-A_cog.txt -outfmt 6 -num_threads 8 -evalue 1e-5
## if you want to conduct a new mapping, for example from a new blast file, remove the old pickle first

import gzip
import cPickle

cog_cats = {}
cogs_seqs = {}
cog_names = {}
pfam_seqs = {}
pfam_cog = {}
import os

if 'pfam_cog_dict.p' not in os.listdir('.'):
     ## map cog name to cog category
     print 'mapping cog name to cog category'
     with open('cogs.csv', 'r') as cog_file:
     for line in cog_file:
          line = line.rstrip()
          line = line.split(',')
          cog_cats[line[0]] = line[1]
          cog_names[line[0]] = line[2]

     ## map cog sequence to cog name
     print 'mapping cog sequence to cog name'
     with open('domdat.csv', 'r') as cogs_seqs_file:
          for line in cogs_seqs_file:
          line = line.rstrip()
          line = line.split(',')
          cogs_seqs[line[0]] = line[6]

     ## map pfam sequence to pfam
     print 'mapping pfam sequence to pfam'
     with open('Pfam-A.fasta', 'r') as pfam_seqs_file:
          for line in pfam_seqs_file:
          if line.startswith('>'):
          line = line.strip('>')
          line = line.rstrip()
          line = line.split()
          pfam_seqs[line[0]] = line[2]

     ## map blast
     print 'mapping blast'
     with gzip.open('Pfam-A_cog.txt.gz', 'rb') as blast:
          for line in blast:
          line = line.rstrip()
          line = line.split()
          pfam_seq = line[0]
          cog_seq = line[1]
          pfam = pfam_seqs[pfam_seq]
          cog = cogs_seqs[cog_seq]
          try:
               cog_name = cog_names[cog]
          except KeyError:
               cog_name = 'NULL'
          try:
               cog_cat = cog_cats[cog]
          except KeyError:
               cog_cat = 'NULL'
          try:
               temp = pfam_cog[pfam]
               temp_cog = temp[0]
               temp_cog_cat = temp[1]
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = temp[2]
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          except KeyError:
               temp_cog = set()
               temp_cog_cat = set()
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = set()
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          print pfam, cog, cog_cat, cog_name

     cPickle.dump(pfam_cog, open('pfam_cog_dict.p', 'wb'))

else:
     print 'using existing pfam to cog map'
     pfam_cog = cPickle.load(open('pfam_cog_dict.p', 'rb'))

     with open('pfam_cog_cat_map.txt', 'w') as output:
          for pfam in pfam_cog.keys():
          temp = pfam_cog[pfam]
          pfam_temp = pfam.split(';')
          pfam_code = pfam_temp[0]
          pfam_code = pfam_code.split('.')
          pfam_code = pfam_code[0]
          for cog_cat in list(temp[1]):
               print >> output, pfam_code+'\t'+pfam_temp[1]+'\t'+cog_cat

This script produces a pickle with the mapping in case you want to format the output differently later without having to redo everything, and a tab-delimited file with the COG categories that each pfam belongs to:

PF07792     Afi1     NULL
PF05346     DUF747     NULL
PF12695     Abhydrolase_5     E
PF12695     Abhydrolase_5     I
PF12695     Abhydrolase_5     FGR
PF12695     Abhydrolase_5     Q
PF12695     Abhydrolase_5     R
PF12695     Abhydrolase_5     U
PF12695     Abhydrolase_5     NULL
PF12644     DUF3782     S

Note that each pfam belongs to many COGS (maybe too many cogs to make this useful in many cases).  Hopefully it will help clarify the role of some of these pfams however!

Posted in Research | Leave a comment

Where goes the carbon in the Palmer LTER?

In the last post I broadly described some of the ecological changes underway in the Palmer LTER.  If we consider only the biological components of any ecosystem (excluding the chemical and physical components) we can diagram their interactions as a food web, with primary producers at the base of the web and top predators at the top.  Changes to the structure of the food web are generally driven by changes at the top (‘top-down’ effects) or at the bottom (‘bottom-up’ effects).  Whaling during the 19th century for example, exerted a very strong top-down effect on the structure of Antarctic food webs.  Rapidly changing sea ice conditions during this century are exerting a strong bottom-up effect by altering the pattern of primary production.

The classic Antarctic food web. The traditional food web is a good learning tool, but is incomplete. In reality a substantial amount of biomass cycles through the marine microbial community. Taken from http://science.jrank.org/kids/pages/63/FITTING-ALL-TOGETHER.html.

The traditional food web that we all learned in elementary school is (not surprisingly) an incomplete representation of the trophic structure of an ecosystem.  If we consider the carbon flowing through such a food web we see it move only in one direction, up from the primary producers to the top consumers.  A rule of thumb is that the biomass of each trophic level is one tenth of the one below it, so as carbon moves up through the food web most of it must be exiting the system.  Where does it go?  Much of it is lost through respiration – we and nearly all other multicellular organisms exhale carbon dioxide.  The rest is lost through inefficient feeding, excretion, and death in the consumer trophic levels.

A pelagic food web with the microbial loop. The diagram represents an open-ocean environment different from the west Antarctic Peninsula, but does a good job at highlighting the function of the microbial loop (at left). DOC leaks from the food web at every trophic level. Bacteria incorporate this DOC into microbial biomass and are consumed by small zooplankton, returning the carbon to the food web. Taken from http://oceanworld.tamu.edu/resources/oceanography-book/microbialweb.htm.

If that was the end of the story carbon would be raining down to the deep ocean at a rapid pace in the form of dead phytoplankton, dead consumers, and fecal pellets (this is actually the goal of ocean fertilization), and the photic zone, the sunlight portion of the ocean that supports photosynthesis, would be awash in small particles of organic carbon (we call this dissolved organic carbon or DOC) that aren’t practical for larger organisms to eat.

What limits this in the ocean are bacteria, which consume DOC and partially degrade detrital particles.  Bacteria are large enough to be prey items for some consumers, so the carbon they incorporate is recycled into the food web in a process known as the microbial loop.  Bacteria take up a huge amount of the carbon that leaves the food web, but not all of it is looped back up to the higher trophic levels.  Like multicellular organisms heterotrophic bacteria respire carbon dioxide.  If that carbon dioxide isn’t take back up by phytoplankton (a further loop embedded in the food web) it will eventually leave the system.

Depth integrated bacterial production for the Palmer LTER during the 2014 LTER cruise, estimated from the uptake of tritiated leucine.

Depth integrated bacterial production for the Palmer LTER during the 2014 LTER cruise, estimated from the uptake of tritiated leucine.

How much carbon is taken up by primary production, incorporated into bacterial biomass, and exported from the photic zone are huge research questions.  None of these questions are particularly easy to answer, and the numbers vary widely across time and space.  On the Palmer LTER cruise we used three different methods to estimate these values for the west Antarctic Peninsula.  I’ll talk a little about the first two here.  Bacterial production is measured by the uptake of tritiated leucine.  Leucine is an amino acid that is broadly taken up by the marine microbial community.  Tritium is a radioactive isotope of hydrogen, tritiated leucine is leucine with tritium in place of hydrogen atoms.  It is possible to quantitatively measure very small amounts of radioactivity, which makes it possible to measure the incorporation of very small amounts of tritiated leucine into microbial biomass.

To do this we take water samples from multiple depths at each station and incubate them with tritiated leucine at the in-situ temperature.  After a few hours the bacteria in the water samples are killed (“fixed”), carefully washed, and measured for radioactivity.  From this a rate of leucine incorporation is calculated.  This number tells us something about the rate at which bacteria are increasing their biomass.  What we actually want to know however, is how much carbon they are taking up to do this.  Many years of experiments have allowed the development of a conversion factor to calculate the rate of carbon incorporation from the rate of leucine incorporation.  What this method doesn’t tell us however, is the amount of carbon that is consumed during microbial respiration – a much harder number to get at!

Depth integrated primary production for the Palmer LTER during the 2014 LTER cruise. Estimated from the uptake of 14C bicarbonate.

Depth integrated primary production for the Palmer LTER during the 2014 LTER cruise. Estimated from the uptake of 14C bicarbonate.

The method for measuring primary production is similar to that for measuring bacterial production, except that radioactive bicarbonate (labeled with the 14C isotope of carbon) is used in place of radioactive leucine.  In seawater the bicarbonate exists as carbon dioxide and is incorporated into phytoplankton biomass during photosynthesis.

All of this fixed carbon exits the system in one of two ways.  Either it is respired as carbon dioxide as it works its way up the food web, eventually ending up in the atmosphere, or it is exported out of the photic zone as sinking detrital particles.  Particulate matter that is not consumed by bacteria below the photic zone will eventually end up on the seafloor.  The consumption of this carbon by the microbial community is often very high right at the seafloor, but rapidly diminishes below the sediment surface as oxidized chemical species are depleted.  Carbon that makes it to this depth is effectively sequestered, until a geological process (such as subduction) returns it to the surface.  Measuring the flux of particulate matter to the seafloor relies on the use of sediment traps or the measurement of radioactive thorium-234, a process that is far more complex than the measure of tritiated leucine or 14C bicarbonate.  To learn more about it check out the website of Café Thorium, Ken Buessler’s research group at WHOI, particularly this article.

The two figures above show our estimates of primary and bacterial production across the study area.  Both sets of values are pretty high.  There was a strong phytoplankton bloom this year (perhaps connected to improved sea ice coverage) and the bacterial community seems to have responded in kind to this input of fixed carbon.  Note however, that bacterial production is an order of magnitude below primary production.  Most of the carbon is exiting the system through respiration or particle export.  A small amount is contained within the biomass of phytoplankton, krill, and the higher trophic levels.  If you look carefully at the two plots you’ll also see that bacterial production is somewhat decoupled from primary production, being high where primary production is low and sometimes low when it is high.  The devil is in the detail and it will take some digging to understand the dynamics of carbon transfer in this system!

The bacterial abundance, production, and export team for the 2014 Palmer LTER cruise, on the deck of the ARSV Laurence M. Gould.  The odd looking contraption behind us is a French made sediment trap.  Sediment traps are invaluable tools, but famously undersample particles sinking to the seafloor.  The French design is supposed to minimize these effects.  Fun fact - tenured WHOI faculty are impervious to the cold and to head injuries.

The bacterial abundance, production, and export team for the 2014 Palmer LTER cruise, on the deck of the ARSV Laurence M. Gould. The odd looking contraption behind us is a French made sediment trap. Sediment traps are invaluable tools, but famously undersample particles sinking to the seafloor. The French design is supposed to minimize these effects. Fun fact – tenured WHOI faculty are impervious to the cold and to head injuries.

 

Posted in Research | Leave a comment

A cruise in the Palmer LTER

The Laurence M. Gould in one of the many coves off the West Antarctic Peninsula.

The Laurence M. Gould in one of the many coves off the West Antarctic Peninsula.

For the last five weeks I’ve been on a research cruise aboard the ARSV Laurence M. Gould off the west Antarctic Peninsula (WAP).  This region has received a lot of attention in recent years as one of the fastest warming places in the planet – annual temperatures have increased 7°C over the last 50 years.  Because of past sealing and whaling and ongoing fishing the WAP is hardly a pristine ecosystem (remember, say no to Chilean sea bass, aka the Patagonian toothfish, one of the most pressured predators in the region).  Nonetheless it’s under much less direct human pressure than most places on Earth which, combined with the rapid rate of warming, makes it a good site to observe the impact of changing climate on marine ecosystems.  To carry out these observations the WAP hosts a long term ecological research site (the Palmer LTER), one of 25 NSF funded LTER sites.  The Palmer LTER is unique in being one of only two marine-pelagic LTERs, and also one of only two LTERs located in Antarctica (the other is in the McMurdo Dry Valleys).  The cruise I participated in takes place each year during the Antarctic summer.  It is tasked with making a variety of measurements throughout the Palmer LTER and otherwise supporting science activities connected with long-term observations there.

Sample sites within the Palmer LTER. Each year the Gould occupies approximately three points spanning each “line”, depending on sea ice conditions.

The Laurence M. Gould has extremely limited access to the internet, so unfortunately it wasn’t possible to upload articles in real time.  Instead I’ll try to summarize the cruise in a couple of articles after the fact.

I mentioned that the climate of the WAP is changing fast.  Twenty years ago the whole peninsula would have been considered a polar environment, hosting a polar marine ecosystem.  The outlying islands that extend in a sweeping arc out to South Georgia Island would have been considered subpolar.  The subpolar ecosystem is now moving south in a hurry, changing the distribution of nutrients, primary production, and marine species.  Because it’s hard to picture what’s going on below the ocean’s surface it’s a little difficult to visualize these changes in the marine ecosystem.  A comparable change in the terrestrial environment might be the encroachment of desert on a region of rich grassland.  Deserts aren’t all bad (I rather like them), but they can’t support the biomass or diversity of grasslands.  In the parlance of economics they can’t support the same level of ecosystem services.

In the WAP a huge driver of ecological change is sea ice.  The species that occupy the Antarctic Peninsula are divided into two groups: ice dependent and ice independent.  The ice dependent species have an absolute requirement for sea ice and include Adélie penguins, leopard seals, crabeater seals, sea ice algae, and most importantly, krill.  The krill relationship with sea ice isn’t totally straightforward.  Adult krill are happy to feed on phytoplankton in the water column and have no real sea ice requirement.  Juvenile krill on the other hand, are dependent on the dense layer of lipid rich algae that grow on the underside of sea ice to support their growth.  No juvenile krill means no adult krill, a problem because everything else in the Antarctic marine ecosystem depends on adult krill.  Without krill there is no way to transfer biomass from primary producers at the base of the foodweb to higher trophic levels.  To return to the grassland analogy, removing the krill from the ecosystem would be like removing all the hoofed mammals from the Serengeti.  All the predators and scavengers would disappear too.   Ultimately even the grass would fail, because the presence of large grazers has benefits like soil aeration and nutrient and seed dispersal.  In the WAP it would be worse than this because the biomass far exceeds that of a grassland of comparable size, and this biomass fits critically into the ecology of such globe-trotting species as the humpback whale, orca, and Wilson’s storm petrel (thought by some to be the world’s most abundant bird).

Sea ice cover in the Palmer LTER.

Sea ice cover in the Palmer LTER from 1981 to 2011.  Anvers Island is at the northern extent of the LTER, Adelaide Island near the center, and Charcot Island at the southern extent.  There is a lot of year to year variation, but the overall trend of reduced ice cover is clear.  From Ducklow et al. 2013, see link at end of post.

The general trend in the WAP is for reduced sea ice and primary production in the north of the study region, and increased primary production in the south.   All the factors influencing this change aren’t yet clear, nor is it clear how these changes will manifest in the future.  It will take decades of observation to clarify the trend and the mechanisms behind it.  One likely driver of the southward shift in primary production is the reduced availability of micronutrients supplied by glacial meltwater.  The Peninsula, like the rest of the continent, is covered with glaciers.  Glaciers melt at their base where they contact rock, and this mineral-rich meltwater is a key source of iron and other micronutrients to the marine ecosystem.  It’s counterintuitive that warming would reduce the availability of meltwater.  The availability of these micronutrients have to do with the distribution of meltwater in the ocean however, not the rate of melt.  Freshwater is much less dense than seawater, so glacial meltwater “floats” overtop seawater, forming a micronutrient rich lens.  In the presence of sea ice this lens is protected from wind driven vertical mixing, and for a brief period each summer following sea ice retreat there is a strong spatial overlap between micronutrient rich water and sunlight.  The result is a “bloom”, a sudden explosion of growth among the primary producers.  In the absence of sea ice storms dilute this surface water with micronutrient poor water from deeper in the ocean.  By summer the amount of photosynthesis that can be supported is much reduced.

Leopard seals in the WAP, near the British Antarctic Survey's Rothera Station.  Leopard seals are a key ice dependent predator in the region.

Leopard seals in the WAP, near the British Antarctic Survey’s Rothera Station. Leopard seals are a key ice dependent predator in the region.

Nothing is that simple however.  In the WAP the upwelling of micronutrient poor deep water through marine canyons appears to strongly support primary production.  Like upwelling water the world over this deep water, while deficient in micronutrients, is rich in macronutrients like nitrate.  So the reality is probably that both water sources, and a reasonably stable water column, are required to support the level of primary production the rest of the ecosystem depends on.  Returning to the grassland analogy, tweaking the delivery and mixing rate of surface and deep water is equivalent to the jetstream shifting its course over land, depriving a historically wet region of water and limiting the amount of plant growth that it can support.

So, from the standpoint of primary production, two important things are happening in the WAP.  First, the reduction in sea ice means there are fewer ice algae to support the growth of juvenile krill.  Second, the reduction in sea ice is increasing the rate of vertical mixing in the water column, reducing the growth of phytoplankton – the food source of adult krill.

This year was an exception to the current trend, with ice conditions rebounding a little after years of decline.  It was nothing approaching the coverage that was normal 15 or 20 years ago, but it was sufficient to support an impressive phytoplankton bloom and will hopefully slow the loss of some of the ice dependent species like the Adélie penguin, whose numbers have been dwindling.  Next post I’ll talk a bit more about some of our surprising findings regarding the fate of all that carbon produced during the bloom…

For more on the Palmer LTER check out this excellent, non-technical article in the magazine of the Oceanography Society: West Antarctic Peninsula: an ice-dependent coastal marine ecosystem in transition

Posted in LMG14-01 | Leave a comment

Android, Mendeley, Referey, and taming the reference beast

Like everyone else in Academia I’m in a constant struggle against information overload.  This comes not only from the overwhelming amount of data from analyses I’m working on, but from an ever-growing stream of scientific literature.  I gave up trying to stay on top of journal tables-of-contents a long time ago, instead I rely on Google Scholar alerts to notify me when relevant publications come out.  Unfortunately this switch didn’t reduce the volume of literature that catches my eye, it merely insured that all the literature would be relevant, interesting, and even more important to file away somewhere and remember.

Some people seem to have a great mental filing system for authors, dates, and subjects of key articles, and can pull a critical reference from the depths of their hard drive based on title alone.  I’m not one of those people.  If I can’t find a reference by a keyword search it doesn’t exist in my universe.  Fortunately, in front of a laptop it is generally possibly to find what you want by keyword search.  Both Windows Search and OSX Spotlight automatically index the contents of pdfs, making it easy to find all the papers in a personal library pertaining to say, Wigglesworthia.  To make this even easier about a year ago I started using Mendeley Desktop, a pretty decent program for organizing and annotating pdfs.  I don’t use Mendeley quite the way its meant to be used however.  The free version of the program comes with a generous chunk of online storage space, which allows you to maintain an online copy of your library that’s accessible from anywhere.  My pdf library is pretty large however, and I’m hesitant to upgrade to the paid version of Mendeley – which comes with more storage space – until I’m really sold on the program (which is happening quickly, aided by the hope that Mendeley’s cite-while-you-write add on will forever remove from my life the horrible, evil, travesty which is Endnote…).

Enter the Android tablet that I recently acquired.  My vision is to be able to sit in a class/seminar/meeting, with that critical paper/fact/reference on the tip of my tongue, and actually be able to LOOK IT UP.  Should be simple enough, right?  Well, maybe.  An additional requirement is that I need to be able to look things up when I’m at sea, in the field, or otherwise not connected to the internet.  This makes it a little trickier.  After reading a number of blog posts from people with similar problems I was able to patch together a solution that works (so far) remarkably well.  The limitations of Android and Apple are equal in this respect, so this solution should be relevant for ipad users too.

My initial though was to migrate my pdf library to Dropbox, and use the Android Dropbox app and native search function to find what I wanted.  Two problems: 1) The Dropbox app doesn’t actually sync files to the tablet (crazy!) and 2) There IS NO native search function on either Droid or Apple ios that searches file contents (crazier still!).  Fortunately I came across Dropsync, a competent little app that picks up where Dropbox falls flat.  With the paid version you can keep select Dropbox folders synced to the SD card in your Droid.  Problem one solved.

If I didn’t need the search function to work without internet I could have downloaded one of a couple of Mendeley-compatible apps, upgraded to the paid version of Mendeley, and lived the life of online library access (ipad owners can cut the middle man and make use of an official Mendely app).  Needing an offline search feature I was pretty well stuck until I found out about Referey (thanks Mendeley, for posting links to third-party apps that help your users…).  Referey was designed with my exact problem in mind.  Here’s how it works:

Having migrated your entire reference library to a folder in Dropbox, you create a symlink in that folder to the sqlite database where Mendeley keeps the index for your library.  On Windows 7 I found the database in C:\Users\Jeff\AppData\Local\Mendeley Ltd\Mendeley Desktop\largedatabasewithyourmendeleyusername.sqlite.  From the destination folder in Dropbox I created a symlink as such:

MKlink largedatabasewithyourmendeleyusername.sqlite "C:\Users\Jeff\AppData\Local\Mendeley Ltd\Mendeley Desktop\largedatabasewithyourmendeleyusername.sqlite"

I believe for Apple/Unix the command is “symlink”.  That keeps both my reference library and the Mendeley index up to date on my tablet.  On your tablet, in preferences, you tell Referey where in your Dropbox (synced with Dropsync, not Dropbox) you’ve stashed the database and your library.  Referey does no further indexing, it simply uses what Mendeley already knows to find what you want in the library.  Pdfs returned in a search can be opened and annotated using whatever app you want (I’m currently using iAnnotate PDF).

For a while I really doubted whether a solution existed for what seems like an obvious problem.  I was pretty happy to find one…

Posted in Research | 3 Comments

Frost flower metagenome paper submitted

I just hit the “submit” button for our latest frost flower paper, which reviewer-be-willing will appear in an upcoming polar and alpine special issue of FEMS Microbial Ecology.  Several scattered bits of analysis from the paper have appeared in this blog, here’s a recap of what we did and why.

In a 2013 paper we described an unusual community of bacteria in Arctic frost flowers composed primarily of members of the bacterial order Rhizobiales.  This was a bit odd because Rhizobiales, while not unknown in the marine environment, or not typically found in high abundance there.  Rhizobiales have definitely not been observed in association with sea ice.  To try and figure out what was going on with this particular community we went back to the same samples (originally collected in April, 2010) and used whole-genome amplification to amplify all the DNA until we had enough to sequence metagenomes from one frost flower and one young sea ice sample.  The good people at Argonne National Lab did the actual sequencing (great work, great prices…).

Following our failed effort in February of 2010 we received an email in April that natural frost flowers were forming.  I went up the next day and sampled from this lead, these are the samples from our 2013 paper (Bowman et al., EMIR, 2013).  That night it snowed a foot, eradicating all frost flowers.  If I'd been a day later I would have missed them.

The field of frost flowers yielding the samples for this paper, and for Bowman et al. 2013.

I’ve never worked with metagenomic data before, and once I had the actual sequence reads it took me a while to figure out how to get the most information out of them.  For a first cut I threw them into MG-RAST.  This confirmed the Rhizobiales dominance easily enough, but for publication quality results you really need to do the analysis yourself.  Ultimately I used a four-pronged approach to evaluate the metagenomes.

MG-RAST, great for a first cut of your data, and for wowing your advisor with snazzy graphics, but no good for rigorous analysis.  This figure from MG-RAST shows the recruitment of metagenomic sequence reads to the genome of Sinorhizobium meliloti 1021.

MG-RAST, great for a first cut of your data, and for wowing your advisor with snazzy graphics, but no good for rigorous analysis. This figure from MG-RAST shows the recruitment of metagenomic sequence reads to the genome of Sinorhizobium meliloti 1021.

1.  Extract 16S rRNA reads and analyze taxonomy.  For this I use mothur, treating the 16S associated reads like any 16S amplicon dataset (accept that you can’t cluster them!).

2.  Align all the reads against all available completed genomes.  We’ve got 6,000 or so genomes in Genbank, may as well do something with them.  This actually provides a ton of information, and backs up the 16S based taxonomy.

3.  Assemble.  I’m not that good at this, the final contigs weren’t that long.  Cool though.

4.  Targeted annotation.  One would love to blastx all the reads against the nr database, or even refseq protein, but this is silly.  It would tie up thousands of cpus for days,this  computing power is better spent elsewhere.  If you have some idea what you’re looking for however, you can blastx against small protein databases pretty easily.  After the blast search you can use pplacer or a similar tool to conduct a more fine scale analysis of the reads with positive matches.

So what’d we find?  The frost flower metagenome contains lots of things that look like Rhizobiales, but that are pretty different from most of the Rhizobiales with sequenced genomes in Genbank.  Among other things they don’t have nodulation genes or genes for nitrogen fixation, unlike many terrestrial Rhizobiales.  The community does have genes for some interesting processes however, including the utilization of dimethylsulfide (DMS) – a climatically active biogenic gas, the degradation of glycine betaine – which may be an important compound in the nitrogen cycle, and haloperoxidases – oxidative stress enzymes that release climatically active methylhalides.

You can find more details on some of this analysis on this poster.

Posted in Research | 1 Comment

Maintaining an updated record of Genbank genomes

For an ongoing project I need a local copy of all the prokaryotic genomes in Genbank.  New genomes are being added an an ever-increasing rate, making it difficult to keep up by manually downloading them from the ftp site.  I finally decided to write a script that will automatically keep my local directory up to date.  I’m only interested in the ffn files, but it would be easy to manipulate the following script for any other file type.  As always suggestions for improving the script are welcome.  A word of warning… I’ve tested all the components of the script individually, but due to long download times I haven’t tested the whole thing as a single entity.

First, the necessary modules:

import subprocess
import os
from ftplib import FTP
import sys
from cStringIO import StringIO

I want the ffn files from both the complete and draft Genbank FTP directories.  Working first with complete:

bacteria = subprocess.Popen('wget -r -A *.ffn -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/', shell = True)
bacteria.communicate()

That should have created a directory tree mimicking that found on the FTP server, but containing only files ending in ffn.  Each genome directory likely contains multiple files, one for each element of the genome.  I’d like to combine them all together so that I can analyze the genome as a unit.

have = os.listdir('combined_ffn')
for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria'):
    if d+'.combined.ffn' not in have:
        subprocess.call('cat ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/'+d+'/*.ffn > combined_ffn/'+d+'.combined.ffn', shell = True)

Now moving on to the draft genomes.  This is a little trickier since genome directories will be removed from Bacteria_DRAFT once assembly and annotation is complete.  My local directory needs to reflect these changes.  To figure out what should be in the draft directory, and remove any local directories that should not longer be there:

old_stdout = sys.stdout
sys.stdout = mystdout = StringIO()

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
ftp.cwd('genbank/genomes/Bacteria_DRAFT/')
ftp.dir()
ftp.quit()

sys.stdout = old_stdout
with open('temp.txt', 'w') as good_out:
    print >> good_out, mystdout.getvalue()

for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT'):
    if d not in mystdout.getvalue():
        subprocess.call('rm -r ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d, shell = True)

Now update the local directory tree.

d_bacteria = subprocess.Popen('wget -r -A *scaffold.ffn.tgz -nc ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/', shell = True)
d_bacteria.communicate()

Another hang up with the draft genomes is that the ffn files are all present as gzipped tarballs.  They need to be untarred before concatenation.

untar = subprocess.Popen('for d in ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/*;do cd $d;ls *.tgz | parallel "tar -xzvf {.}.tgz";rm *.tgz;cd /volumes/deming/databases/genomes_ffn;done', shell = True)
untar.communicate()

And now we’re on the home stretch…

for d in os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT'):
    cont = False
    contents = os.listdir('ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d)
    for f in contents:
        if 'ffn' in f:
            cont = True
    if d+'.draft.combined.ffn' in have:
        cont = False
    if cont == True:
        subprocess.call('cat ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/'+d+'/*.ffn > combined_ffn/'+d+'.draft.combined.ffn', shell = True)

I tried to write this script in both pure bash and pure Python, and failed in both cases.  A more experienced scripter would probably have little trouble with either, but my version does illustrate how well bash and python play together.  For the first run I executed the script manually.  It’ll probably take a couple days to download on our building’s 1 Mbps connection (come on UW, it’s 2013…).  Once things are running smooth I’ll place a call to the script in /private/etc/periodic/weekly and have an automatically updated database!

 

Posted in Research | Leave a comment

Online bioinformatics courses

Speaking of the UW Oceanography Bioinformatics Seminar… one of the seminar participants just sent around a Plos Computational Biology that lists a complete curriculum’s worth of free online bioinformatics courses.  Some of the courses are more suitable for computer scientists looking to learn more biology, others are for biologists looking to learn more computer science.  I haven’t checked out any of the courses yet, but they look intriguing.

Like all Plos journals Plost Computational Biology is open access, so the article can be freely accessed here.  Enjoy!

Posted in Uncategorized | Leave a comment

Phylogenetic Placement Revisited

UW Oceanography has a student-lead bioinformatics seminar that meets every spring and fall (or at least has since the fall of 2012 – we hope to continue it).  Each quarter is a little different, this time around each seminar attendee will present on a tool or set of tools that they’ve applied in their research.  It’s a good way for us to keep up on the ever-changing slate of available tools for sequence analysis, and to delve a little deeper into the ones we currently use.  This week its my turn, and I’ve prepared a short tutorial on pplacer, a common phylogenetic placement program.  I wrote a post about phylogenetic placement a few months ago.

There is some unpublished data posted on the seminar website, so it unfortunately isn’t public.  To make the tutorial available I’ve copied it here.  What was great about putting this tutorial together was the opportunity to try papara, an alignment program developed by the same group responsible for RAxML.  Check out Scenario 2 in the tutorial for details on papara.

**************************************************************************

This is a tutorial for using pplacer with 16S rRNA genes under two three separate scenarios:

1.  Reference and reference-query alignment using mothur, tree building using RAxML
2.  Reference alignment using mothur, reference-query alignment using papara, tree building using RAxML
3.  Reference alignment using mothur, reference-query alignment using papara, tree building using FastTreeMP

There is really no difference between these 16S rRNA examples and execution with functional gene or aa data, except that of course you would need an alternate alignment strategy (clustalo, hmmalign, muscle) to generate the reference alignment.

Raw data files and scripts appear at the bottom of this page.  You will need to download the following:

mothur
RAxML (compile the PTHREADS versions)
FastTree (use the MP version)
papara – make sure you have recent versions of boost and the gcc compiler, does not compile with gcc v4.2
pplacer (and taxtastic, guppy)
Archaeopteryx

The following commands are specific to the system they were run on, so be sure to chance directory paths where necessary and the number of cpus.  If you are running this on a mac’nbook set cpus/threads to 2 for mothur and RAxML.  For pplacer to work well the query dataset should fit within the taxonomic range represented by the reference tree.  In this example a set of 16S rRNA reads were classified at the level of order, and we will try to place these reads on a reference tree constructed from full length 16S sequences spanning that order.

Scenario 1 – Reference and reference-query alignment using mothur, tree building using RAxML

First, clean all punctuation from sequence names, note that this will also degap the sequences.  This step is necessary because pplacer tends to break on punctuation.

tr " " "_" < rdp_rhizobiales_type.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > rdp_rhizobiales_type.clean.fasta

Now use mothur to align the reference sequences against an existing high-quality (we hope) alignment.  I’ve used a concatenation of the Eukaryote, Bacteria, and Archaea Silva alignments available on the mothur website.  It’s too big to share on this site, and is overkill for this analysis, but easy to put together if you want to.  Otherwise just use the Bacteria reference file found here.

mothur "#align.seqs(candidate=rdp_rhizobiales_type.clean.fasta, \
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(trump=., vertical=T, processors=8);\
unique.seqs()"

## rename to something convenient

mv rdp_rhizobiales_type.clean.filter.unique.fasta \
rdp_rhizobiales_type.final.fasta

Now build the reference tree with RAxML.  You’ll note that the tree is not bootstrapped, pplacer can’t handle the statistics file that comes with the bootstrapped tree.  It is possible to work around this by generating multiple trees (with multiple statistics files), here we just aren’t going to worry about it.  In practice reviewers don’t seem to worry that much about it either.  Note that we need the alignment in phy format, and that RAxML won’t write over files of the same name.

perl ./fasta2phyml.pl rdp_rhizobiales_type.final.fasta
rm RAxML*
raxmlHPC-Pthreads -m GTRGAMMA \
-n rdp_rhizobiales_type.final \
-p 1234 -T 8 \
-s rdp_rhizobiales_type.final.phymlAln

pplacer wants the original fasta reference alignment, the reference tree, and the tree building statistics file all packaged together in one re-usable package.  The use of reference packages also aids in the implementation of some of pplacers fancier features, like taxonomic classification.  Ref package creation is accomplished with the taxit command from taxtastic, which comes bundled with pplacer.  Taxit won’t overwrite an existing reference package, so it’s useful to remove it beforehand.

rm -r rdp_type_rhizobiales.refpkg
taxit create -l 16S -P rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats RAxML_info.rdp_rhizobiales_type.final \
--tree-file RAxML_bestTree.rdp_rhizobiales_type.final

Now we prep the query reads in the same manner as the reference sequences.

tr " " "_" < galand_2009_rhizobiales.fasta | \
 tr -d "%" | \
 tr -d "\'" | \
 tr -d "," | \
 tr -d ";" | \
 tr -d "(" | \
 tr -d ")" | \
 tr ":" "_" | \
 tr "=" "_" | \
 tr "." "_" | \
 tr -d "\"" | \
 tr -d "\*" | \
 tr -d "[" | \
 tr -d "]" | \
 tr "-" "_" > galand_2009_rhizobiales.clean.fasta

Combine the query and reference sequences.

cat galand_2009_rhizobiales.clean.fasta \
rdp_rhizobiales_type.clean.fasta > \
galand_rdp_rhizobiales.combined.clean.fasta

And align.

mothur "#align.seqs(candidate=galand_rdp_rhizobiales.combined.clean.fasta,\
template=/volumes/deming/databases/silva.abe.fasta, processors=8, flip=T);\
filter.seqs(vertical=T, processors=8);\
screen.seqs(minlength=50)"

At this point it is generally preferable to look at the alignment and limit the analysis to the earliest and latest start position of your reference alignment.  You can get a rough fix on this by using the mothur command summary.seqs() on the reference only alignment.  You can then use screen.seqs() to eliminate any whacky reads that didn’t align right.  Here I don’t worry about it.  In either case be sure to convert all your gap characters to ‘-‘ as pplacer doesn’t like ‘.’.

tr "." "-" \
galand_rdp_rhizobiales.combined.clean.filter.good.fasta > \
galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta

mv galand_rdp_rhizobiales.combined.clean.filter.good.clean.fasta \
galand_rdp_rhizobiales.combined.final.fasta

With a lot of luck pplacer is happy with everything you’ve done up to this point, and won’t yell at you.  Go ahead and run it.  If you start seeing working on… messages roll across your screen you can sigh in relief.

One nasty thing that pplacer does if you aren’t looking out for it is produce multiple placements, as opposed to just the top placement.  If you want to keep things easy in downstream analysis tell it to just keep the best placement.

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p -o galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.combined.final.fasta

If pplacer ran successfully it’s time to make some visualizations.  The guppy program comes with pplacer and converts the .json file to phyloxml, newick, and csv formats.  It also can do some statistics and fancier things.  Here we use it to make a tree with edges fattened in proportion to the number of query sequences placed there.  After that we make a simple table of nodes and the number of placements at each.  This is great if you just want to make a histogram of your placements, which can be a lot easier than dealing with trees.

guppy fat --pp galand_rdp_rhizobiales.jplace

guppy to_csv --pp galand_rdp_rhizobiales.jplace > \
galand_rdp_rhizobiales.csv

Scenario 2 – using papara in place of the second round of mothur

Scenario 2 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with papara, followed by pplacer…

./papara -t RAxML_bestTree.rdp_rhizobiales_type.final \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp

mv papara_alignment.galand_rdp \
papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c rdp_type_rhizobiales.refpkg \
-p \
-o papara_galand_rdp_rhizobiales.jplace \
papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp papara_galand_rdp_rhizobiales.jplace > \
papara_galand_rdp_rhizobiales.csv

Now you can make use of the guppy kr_heat command to highlight difference in the two trees.

guppy kr_heat --pp \
papara_galand_rdp_rhizobiales.jplace \
galand_rdp_rhizobiales.jplace

Visualize all the phyloxml output with Archaeopteryx.
Scenario 3 – use FastTreeMP in place of RAxML

Both pplacer and papara are pretty happy using trees produced by FastTree, which can be a pretty elegant alternative to RAxML.  Although it is much more streamlined than RAxML, it is not quite as accurate, and should not be used on alignments with fewer than 50 sequences.

Scenario 3 requires that you have some of the files from Scenario 1, so run that if you haven’t already.  If you have, then launch right in with FastTreeMP followed by papara.  Note: I don’t think FastTreeMP was the default name of that binary, check when you install the parallel version of FastTree…

FastTreeMP -nt -gtr \
-log ft_rdp_rhizobiales_type.final.log \
rdp_rhizobiales_type.final.fasta >\
ft_rdp_rhizobiales_type.final.tre

rm -r ft_rdp_type_rhizobiales.refpkg

taxit create -l 16S -P ft_rdp_type_rhizobiales.refpkg \
--aln-fasta rdp_rhizobiales_type.final.fasta \
--tree-stats ft_rdp_rhizobiales_type.final.log \
--tree-file ft_rdp_rhizobiales_type.final.tre

papara -t ft_rdp_rhizobiales_type.final.tre \
-s rdp_rhizobiales_type.final.phymlAln \
-q galand_2009_rhizobiales.clean.fasta \
-n galand_rdp_2

mv papara_alignment.galand_rdp \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

python phyml2fasta.py \
ft_papara_galand_rdp_rhizobiales.combined.final.phy

pplacer --keep-at-most 1 \
-c ft_rdp_type_rhizobiales.refpkg \
-p \
-o ft_papara_galand_rdp_rhizobiales.jplace \
ft_papara_galand_rdp_rhizobiales.combined.final.fasta

guppy fat --pp ft_papara_galand_rdp_rhizobiales.jplace

guppy to_csv --pp ft_papara_galand_rdp_rhizobiales.jplace > \
ft_papara_galand_rdp_rhizobiales.csv

Below: Heat tree showing the difference between mothur and papara alignment after Scenario 2.  Reference tree is RAxML.

 

 
Scripts and data files
Posted in Research | 2 Comments