Things are looking up as the snow comes down…

After a tough couple of weeks things are starting to look up.  I’ve got the flow cytometer up and running, and Colleen’s instrument received a complete makeover (thanks to the über instrument tech at Palmer) and is producing good data.  The big question is whether I can gain enough proficiency over the next two days to keep it going after Colleen leaves on Sunday.

The operational instrument status comes just in time; yesterday we went back to the sea ice station that we established on Tuesday to do some science.  In addition to collecting some pretty novel data it was a good chance to practice the measurements we’ll be making all season for the Palmer LTER.  It felt good to get out but hopefully for most of the season it will be a little warmer, however.  That it would be cold in early spring in Antarctica is kind of a no-brainer, but that didn’t keep it from surprising me yesterday.  And the downside to doing fieldwork cold is that it takes longer, so you end up getting colder, and things take even longer…

Jamie, Nicole, and I get escorted off the ice by a lone adélie penguin. The adélie seemed pretty interested in what we were up to at the ice station and followed us the whole way back. Thanks to Rebecca Shoop, the Palmer Station manager, for the photo.

Jamie, Nicole, and I get escorted off the ice by a lone adélie  penguin. The adélie seemed pretty interested in what we were up to at the ice station and followed us the whole way back. Thanks to Rebecca Shoop, the Palmer Station manager, for the photo.

In addition to making all the core LTER measurements (see the end for descriptions); chlorophyll a, nutrients (inorganic nitrogen and phosphorous), primary production, bacterial production, dissolved organic carbon, particulate organic carbon, bacterial abundance, photosynthetically active radiation, and UV, we took multiple RNA and DNA samples (my main focus for this trip), large amounts of water for lipidomics (Jamie’s project) and samples to measure hydrogen peroxide.  This last measurement was a consolation prize since we couldn’t measure superoxide – the two species have some similarities – and it gives us some indication of what to expect now that Colleen’s instrument is up and running.

So what did we find?  It’s early in the season, and there isn’t that much happening yet below the ice.  Everything is driven by light, and it’s pretty dark under there.  But things are starting to happen, and all the action is near the ice.  We measured only two depths in the water column (and that still took us over three hours), just below the ice and 2 meters further down.  Even over that short distance there was a big difference in what’s going on.  The concentration of hydrogen peroxide – a byproduct of photosynthesis – was much higher near the ice, and there were about four times as many bacteria just beneath the ice than 2 meters below it.

Hopefully, if the weather’s good we’ll get a chance to go back out on Monday.  If the ice holds together for just a couple more weeks we’ll be able to document the transition from an ice-covered to an ice-free state, and get the data to test some hypotheses about how bacteria and phytoplankton respond to this transition.  In the meantime yesterday’s bitterly cold wind has given way to calm conditions and outside the snow is falling.  The woodstove in the Palmer Station galley is putting out a nice glow and the stress of fieldwork is dissipating for a moment…

As promised here’s a quick description of the core LTER measurements:

Chlorophyll a: The principal (but certainly not only) photosynthetic pigment in phytoplankton.  Oceanographers having been measuring the concentration of chlorophyll a in the water for a long time as a measure of phytoplankton biomass, and as an estimate of how much primary production is happening.

Primary production: The amount of carbon dioxide that is being taken up by phytoplankton and converted into organic carbon.  The whole food web depends on primary production, and much of our work is focused on what aspects of the ecosystem control the amount that happens.

Bacterial production: Sort of the inverse of primary production, this is the amount of organic carbon taken up by bacteria.  We can’t measure this directly so we estimate it from the uptake of certain carbon compounds that we can track.

Dissolved organic carbon: One of the most mysterious types of carbon out there (see this article for some indication why).  This is organic carbon in pieces small enough for bacteria to take them up.

Particulate organic carbon: Phytoplankton die, they become particulate organic carbon.  It’s sad.

Bacterial abundance: The number of bacteria in the water, measured on our now operational flow cytometer.

Nutrients: Nutrients in the ocean are operationally divided into macro and micro categories, depending on their biologically relevant concentrations.  We measure nitrogen and phosphorous, the principal macronutrients.

Photosynthetically active radiation (PAR): In addition to nutrients phytoplankton need light to grow.  PAR is the part of the electromagnetic spectrum that can actually be used in photosynthesis.  Too little PAR (like under thick, snow covered ice) and you get very little photosynthesis.  Too much PAR (like at the surface of the ocean during the Antarctic summer) also produces very little photosynthesis!

Posted in Palmer 2015 field season | Leave a comment

Rough Start but Smooth Ice

We’re off to a rough start this season!  Two of our instruments are down, including our flow cytometer – annoying, but we can deal with it – and Colleen’s instrument for measuring superoxide.  That’s a real problem.  Colleen is only with us for five more days.  When she leaves the instrument stays, but we will no longer have a skilled operator!  Measuring superoxide is not trivial and I was supposed to spend a good chunk of this week learning how to do it.  That’s going to be tricky with no instrument.  Fortunately the instrument tech at Palmer this season is handy with a soldering iron and seems to have some ideas.  We’ll see how that plays out tomorrow.

The one piece of good news this week is that the big storm last Sunday didn’t do much damage to the land-fast sea ice near Palmer Station.  At least for now we can do a little science on the ice.  This afternoon Jamie Collins, Nicole Couto, and I went out with the SAR team to establish a sea ice sample site near the station.  Hopefully we can get a couple weeks of sampling at this site before the sea ice deteriorates.

Jamie measures ice thickness. Right about 70 cm in this case; nice thick ice that will hopefully stick around for a while.

Jamie measures ice thickness. Right about 70 cm in this case; nice thick ice that will hopefully stick around for a while.

Being able to do some science on the sea ice at Palmer Station is actually a pretty big deal and an unexpected bonus for this season.  In some ways this is a very logical place to study ice.  Palmer Station is the United States’ premier polar marine research station, and you can find dozens of papers describing the ecological importance of sea ice in this region.  It’s been years however, since anyone was able to routinely access sea ice from the station.  Considering the amount of ecological research that takes place here this actually seems a little silly; the single most important feature is virtually ignored for practical reasons.  Working on ephemeral, dynamic sea ice requires a set of skills, equipment, and intrepidness that simply doesn’t exist in this day and age within the US Antarctic Program.

The bottom piece of an ice core collected today. It's early in the season and there isn't much happening yet. If you squint though, you can see the faintest green in the ice, a hint of the algal bloom to come.

The bottom piece of an ice core collected today. It’s early in the season and there isn’t much happening yet. If you squint though you can see the faintest green in the ice, a hint of the algal bloom to come.

Our very small adventure today (on relatively thick, static ice) is reason to hope that that might eventually change.  There isn’t a lot of institutional knowledge about sea ice at Palmer Station, but Station staff and management are open minded and seem eager to learn.  As a further indication the Cold Regions Research and Engineering Lab recently provided new recommendations for sea ice operations at McMurdo Station, a major step toward a rational, data-based policy for traveling and working on ice (which I’ll link it I can find, too tired to search now… must fix flow cytometer…).

Hopefully we can get some good science done on the sea ice this season.  In the Arctic large, under ice phytoplankton blooms are a major source of new carbon to the ecosystem.  In the Antarctic blooms of algae at the ice-water interface are an essential food source for juvenile krill – adult krill being the major food source for virtually everything else down here.  Getting some indication of when, where, and how often these events occur along the West Antarctic Peninsula will tell us a lot about how these ecosystems function, and what will happen to them as the ice season and range continues to decline.

In case you ever have to track a penguin, this is what penguin tracks look like.

And in case you ever have to track a penguin, this is what penguin tracks look like.

Posted in Palmer 2015 field season | Leave a comment

Punta Arenas to Palmer Station

We arrived at Palmer Station last Thursday morning after a particularly long trip down from Punta Arenas. Depending on the weather the trip across the Drake Passage and down the Peninsula to Anvers Island typically takes about four days. This time however, the Laurence M. Gould had science to do and a NOAA field camp to put in at Cape Shirreff on Livingston Island. This was a particularly welcome event as it gave us an opportunity to get off the boat and get a little exercise unloading 5 months of supplies for the NOAA science team.

Since arriving at Palmer Station the activity has been nonstop. In addition to lab orientations and water safety training there is the seemingly never-ending job of setting up our lab and getting instruments up and running. Yesterday evening following the weekly station meeting we did manage to go for a short ski on the glacier out behind the station. I’m glad we did because today the weather took a real turn for the worse; winds are gusting to 55 knots and strengthening. This is a real concern for us because wind strength and direction are the primary determinant of the presence and condition of sea ice in this area. As I wrote in my previous post we are hoping for sea ice to be either very solid, so we can sample from it or clear out completely, so we can get the zodiacs in the water. We’ll have to wait until the storm passes to see what conditions are like but very likely it will be neither!

DSC_0453

A derelict steel-hulled sailing vessel beached outside of Punta Arenas (taken in 2013 on my last trip to Antarctica). Before the Panama Canal opened Punta Arenas, located on the Strait of Magellan, was an important stopping point for ships sailing between the Atlantic and Pacific. Today the city is best known as the jumping off point for cruise ships (and research vessels) heading to Antarctica, and as an access point to Chile’s Torres del Paine.

DSC_0040

A moderate swell breaking on the side of the Gould as we leave Tierra del Fuego behind. Overall it was an extremely mild crossing of the Drake Passage, which didn’t prevent me from getting sick (as per SOP).

DSC_0076

As we crossed the Antarctic Polar Front the weather got noticeably colder. Here, sea spray freezes on one of the Gould’s spotlights.

DSC_0132

A welcome diversion was the NOAA field camp put-in at Cape Shireff. This included such antics as raising (sans crane) a four-wheeler from a bobbing zodiac onto a six-foot high snow berm. Don’t ask me how it was done; I was there and I’m still not sure. In this photo you can see the Laurence M. Gould in the background and a zodiac bringing in another load of supplies.

DSC_0184

The Gould picks its way towards the pack ice. I’ve only been on two ships in sea ice and the experiences couldn’t have been more different. Back in 2009 I sailed in the Arctic onboard Oden, a powerful Swedish icebreaker. We smashed ice a meter thick and more day and “night” (it was summer) for six weeks straight. The Gould is a different sort of animal. It isn’t a true icebreaker and, if winds and currents conspired against it, could become trapped in rafting ice. Moving the Gould into even thin ice is a a delicate process.

DSC_0357

Science in action! There were two science parties conducting research on the way down. One is studying the distribution of krill in the Drake Passage. The other is studying the response of deep water corals to ocean warming. Here graduate student Caitlin Cleaver from the University of Maine washes corals freshly collected from 700 meters deep. The corals were transported live to Palmer Station for further experiments.

DSC_0371

Yesterday morning the Laurence M. Gould raised its gangplank and departed Palmer Station.

DSC_0401

Jamie and Colleen take a break from lab setup for a hike on the glacier behind Palmer Station (seen in the background).

DSC_0418

Today started calm with grey skies, conditions deteriorated with astonishing speed after lunch. Within just a few minutes winds went from a study 20 knots to gusting to 55 knots (63 mph).

Posted in Palmer 2015 field season | Leave a comment

Enroute to Palmer Station

I’m currently sitting in the Dallas airport waiting for a flight to Santiago, Chile, enroute to Palmer Station for the 2015 spring season. Since there is no airfield at Palmer we’ll go in and out by boat (the ARSV Laurence M. Gould). Hopefully we’ll be at the station by October 28 and able to start doing some science not too long after that. There are a couple of reasons why I’m excited about the upcoming season. First, as I discuss in this post, conditions are highly unusual this year, with the extent of sea ice reaching a level not seen at Palmer Station for many years. The reason for this seems to be the persistent warm El Niño conditions in the tropical Pacific Ocean, now complemented by a near zero to negative Southern Annual Mode (negative SAM values are correlated to high sea ice conditions). This increase in sea ice is a counter intuitive but very real effect of global climate change; increased heat in one area of the globe alters global wind patterns and decreases the flow of heat to other areas of the globe. It hasn’t actually been very cold at Palmer Station (the high today was a balmy 24 °F at the time of writing) and how long the sea ice lasts will be depend very much on what happens to winds in the region.

Coming in an era defined by decreasing sea ice along the West Antarctic Peninsula the presence of heavy ice cover could have some interesting ecological impacts. There is a strong likelihood that it will be good for the Adélie penguins, but my primary interest is a little lower down in the food web. I’ll be studying interactions between phytoplankton, the basal food source for the WAP ecosystem, and bacteria at the onset of the spring bloom, hoping to identify cooperative interactions through patterns in bacterial gene expression. Toxic compounds produced by phytoplankton, for example, may be cleaned up by bacterial partners, allowing photosynthesis to proceed more efficiently (ultimately meaning more food for the whole food web). Observing the expression of genes coding for the bacterial enzymes that carry out these processes would be strong evidence for this kind of synergy, which leads me to the second reason I’m excited about the upcoming season.

Electron configuration of superoxide. The extra electron is one more than oxygen an handle, and makes the molecule highly reactive.

Electron configuration of superoxide. The extra electron is one more than oxygen can handle, and makes the molecule highly reactive. Image from https://commons.wikimedia.org/wiki/File:Superoxide.png.

This year I’m joined by Colleen Hansel and Jamie Collins from the Woods Hole Oceanographic Institute. Colleen and Jamie are chemical oceanographers and experts in identifying specific compounds produced by phytoplankton. Colleen has pioneered a technique to measure superoxide, a damaging free radical, directly in the water column. This is not a trivial undertaking as the half-life of superoxide is only seconds, making traditional oceanographic sampling techniques (such as a Niskin bottle) impossible to employ. Instead we will focus on sampling water in the first few meters of the water column, just above the maximum zone of primary production. Superoxide is produced during photosynthesis, when energetic electrons glob onto free oxygen. The extra electron makes oxygen highly reactive (hence superoxide; it’s a superoxidant) and physiologically damaging. Bacteria have some interesting molecular tools to deal with superoxide however, so perhaps they’ve evolved the ability to perform this service for phytoplankton in exchange for fixed carbon. Coupling observations of gene expression with measures of superoxide and other reactive chemical species is much more powerful, and will tell a much more complete story, than either does alone.

It’s impossible to anticipate how the ice will impact our science plan until we’re at the station and get a feel for how logistics will work this season. Typically sampling at Palmer Station is done by zodiac, which requires reasonably ice-free conditions. The zodiacs can push around a small amount of brash ice but lack the mass (and shrouded propeller) to deal with large quantities. The ice is solid enough this year that we may be allowed to use this ice as a sampling platform – something I’ve got plenty of experience with from previous trips to the Arctic and Antarctic. This is a little out of the norm for Palmer Station however, so we’ll have to see how negotiations proceed.

In our worst-case scenario the ice conditions deteriorate to the point that we can’t sample from it, but not so much that we can push a zodiac through it. The normal sampling procedure in this case is to use a plumbed seawater intake to sample from below the ice (with the added benefit that you can sample from the comfort of the lab), however, this won’t work given the short half-life of superoxide. In this eventuality I think we can salvage the project by focusing on ice algae in place of phytoplankton. Ice algae are essentially phytoplankton which have given up their free-living lifestyle and formed colonies on the underside of the sea ice. These dense mats are a very important food source for juvenile krill, but are understudied in the region given the inconsistent nature of sea ice along the WAP. If we can access some decent ice floes from shore I think we can make a good study of the superoxide gradient, and bacterial response, toward the ice algal colonies. Previous work has shown that ice algae can be under significant oxidative stress so they may have good reason to solicit a little help from bacteria.

Posted in Palmer 2015 field season | Leave a comment

paprica v0.20

A couple of months ago I published paprica v0.11, a set of scripts for conducting a metabolic inference from a collection of 16S rRNA gene reads.  This approach allows you to estimate the functional capabilities of a microbial community if you don’t have access to a metagenome or metatranscriptome.  Paprica started as a method for a paper I was writing but eventually became complex enough to warrant it’s own publication.  Paprica v0.11 reflected this origin – it produced nice results but was cludgy and cumbersome.

Over the last couple of weeks I’ve given paprica a complete overhaul and am happy to introduce v0.20.  There are a number of major differences between v0.11 and v0.20, but the most significant difference is a more clear division between construction of the database for those who want full control (and access to the PGDBs) and sample analysis, which can proceed with only the provided, light-weight database (however you will not have access to the PGDBs).  Executing paprica v0.20 is as easy as (from your home directory, for the provided file test.fasta):

git clone https://github.com/bowmanjeffs/genome_finder.git
cd genome_finder
chmod a+x paprica_run.sh
./paprica_run.sh test

One really important distinction between this version and v0.11 is that metabolic pathways are NOT predicted directly on internal nodes.  This was done for reasons of organization and efficiency, but I’m not sure that it made much sense to do this anyway.  Instead the pathways likely to be found for an internal node are inferred from their appearance in terminal daughter nodes (that is, the completed genomes that belong to the clade defined by the internal node).  If a given pathway is present in some specified fraction (0.90 by default) of the terminal daughters it is included in the internal node.  You can change this value by modifying the appropriate variable in pathway_profile.txt.  Some (including myself) might like to have a PGDB for an internal node for purposes of visualization or modeling.  In the near future I’ll release a utility to create a PGDB for an internal node on demand.

Some other major improvements…

  • Fewer dependencies.  For the scripts called in paprica_run.sh you need pplacer, seqmagick, infernal, and some Python modules that you should probably have anyway.
  • Improved reference tree.  I’m still working on this, but the current method uses RAxML for phylogenetic inference and Infernal for aligment, which seems to work much better than the previous (albeit much faster) combo of Fasttree and Mothur.  Thanks to Eric Matsen for helpful suggestions in this regard.
  • More genome parameters.  I have a particular interest in how genome parameters (e.g. length, coding density, etc.) are distributed in the environment.  Paprica gives you a whole list of interesting metrics for the terminal and internal nodes.

Paprica is still in heavy development and I have a lot of improvements planned for future versions.  If you try v0.20 I’d love to know what you think – good, bad, or otherwise!  You can create an issue on Github or email me.

Posted in paprica | Leave a comment

SCAR session on microbial ecology

Along with colleagues from New Zealand, Argentina, and Malaysia I’m convening a session on microbial ecology and evolution at the upcoming biennial SCAR meeting in Kuala Lumpur (because there’s no better place to talk about ice than the tropics).  If this sounds like your sort of thing check it out!

S23. Microbes, diversity, and ecological roles

Walter MacCormack, Argentina; Charles Lee, New Zealand; Chun Wie Chong, Malaysia; Jeff Bowman, USA

The ecology of Antarctica is largely shaped by microbes, with microbial life, including prokaryotes and unicellular eukaryotes, serving as the main drivers of ecosystem function.  Given this, it is perhaps surprisingly that our current understanding of Antarctic biota has been derived primarily from studies of metazoans. Despite major advances in the field of Antarctic microbiology in recent years there remains a knowledge gap in our understanding of the distribution, functions, and adaptations of Antarctic microbes. There is a general consensus that Antarctic microorganisms are highly diverse, and in many cases encompass endemic gene pools with unique physiological and genetic adaptations to the extreme conditions of their environment. Relatively recently, the advent of ‘omics platforms has allowed researchers to observe these processes in great detail. This session welcomes submissions on all aspects of microbial ecology and evolution in Antarctica and the Southern Ocean. This includes ‘omics-based approaches to understanding prokaryotic and unicellular eukaryotic diversity, function, adaptation, as well as laboratory and field-based studies of microbial and ecological physiology. Special consideration will be given for abstracts addressing the following issues: (1) Microbial biogeography, functional redundancy, and ecosystem services; (2) Trophic connectivity between prokaryotes and eukaryotes; (3) Cold adaptation strategy and evolution; and (4) Multiple ‘omics integration addressing systems biology of Antarctic ecosystems.

Posted in Uncategorized | Leave a comment

Sea ice bacteria review published

I’m really excited (and relieved) to report that my review on the taxonomy and function of sea ice microbial communities was recently published in the journal Elementa.  The review is part of a series on biological exchange processes at the sea ice interface, by the SCOR working group of the same name (BEPSII).  I’m deeply appreciative of Nadja Steiner, Lisa Miller*, Jaqueline Stefels, and the other senior members of BEPSII for letting (very) junior scientists take such an active role in the working group.  I conceived the review in a foggy haze last year while writing my dissertation, when I assumed that there would be “plenty of time” for that kind of project before starting my postdoc.  Considering that I didn’t even start aggregating the necessary data until I got to Lamont I’m also deeply appreciative of my postdoctoral advisor for supporting this effort…

The review is really half review, half meta-analysis of existing sea ice data.  The first bit, which draws heavily on the introduction to my dissertation, describes some of the history of sea ice microbial ecology (which goes back to at least 1918 for prokaryotes).  From there the review moves into an analysis of the taxonomic composition of the sea ice microbial community, based on existing 16S rRNA gene sequence data, takes a look at patterns of bacterial and primary production in sea ice, and then uses PAPRICA to infer metabolic function for the observed microbial taxa (after 97 years we still don’t have any metagenomes for sea ice – let alone metatranscriptomes – and precious few isolates).

There is a lot of info in this paper but I hope a few big points make it across.  First, we have a massive geographical bias in our sea ice samples.  This is to be expected, but I don’t think we should just accept it as what has to be.  More disconcerting, there has been very little effort to integrate physiological measures in sea ice (such as bacterial production) with analyses of microbial community structure.  A major exception is the work of the Kaartokallio group at the Finnish Environmental Group, but their work has primarily taken place in the Baltic Sea (an excellent system, but very different from the high Arctic and coastal Antarctic).  This all translates into work that needs to be done however, which is a good thing… we are just barely at the point where we can make reasonable hypothesis regarding the functions of these communities.

Taken from Bowman, 2015. Sampling locations for sea ice studies that have collected community structure data (blue), ecological physiology data (red), and both (orange). Note the strong sampling bias, particularly in the Antarctic. The black arrows point to the locations of the two community structure studies (at the time of writing) that we sufficiently deep to actually describe community structure.

Taken from Bowman, 2015. Sampling locations for sea ice studies that have collected community structure data (blue), ecological physiology data (red), and both (green). Note the strong sampling bias, particularly in the Antarctic. The black arrows point to the locations of the two community structure studies (at the time of writing) that we sufficiently deep to actually describe community structure.

*This image of Lisa pops up a lot. If you can identify what, exactly, is going on in this picture I’ll buy you a beer.

Posted in Research | Leave a comment

Microbial ecology of the cryosphere

A quick post on an excellent review published last week by Antje Boetius and co-authors (including Jody Deming, my PhD advisor) in Nature Reviews Microbiology, titled Microbial ecology of the cryosphere: sea ice and glacial habitats.  The review, focused on viral, bacterial, and archael microbes, provides an excellent overview of the major habitats within the cryosphere (broadly glacial ice, sea ice, and snow), the challenges and opportunities for microbial life, and the observed distribution of taxa and genes (to the extent that we know it).  Like most Nature Reviews it is written for a broad audience and assumes no deep knowledge of microbial ecology or the cryosphere.

Taken from Boetius et al., 2015.

Taken from Boetius et al., 2015.  Top: a schematic of different elements of the cryosphere, b: warm, summertime sea ice, c: the supraglacial environment, featuring a meltriver, d: cold winter sea ice, e: the subglacial environment, featuring the Blood Falls outflow from Taylor Glacier.

Plenty of reviews have been written on microbial life at low temperature, what makes this one stand out to me is the ecological focus.  Although discussions of biogeography (i.e. what taxa are where) and metabolism are woven throughout the review, the emphasis is on habitats, including newly recognized habitats like frost flowers and saline snow.  Check it out!

Posted in paprica | Leave a comment

And now…

…for something completely different.  My wife and I are expecting our first child in a few months, which is wonderful and all, but means that we are faced with the daunting task of coming up with a name.  Being data analysis types (she much more than me), and subscribing to the philosophy that there is no problem that Python can’t solve, we decided to write competing scripts to select a good subset of names.  This is my first crack at a script (which I’ve titled BAMBI for BAby naMe BIas), I’ve also posted the code to Github.  That will stay up to date as I refine my method (in case you too would like Python to name your child).

My general approach was to take the list of baby names used in 2014 and published by the Social Security Agency here, bias against the very rare and very common names (personal preference), then somehow use a combination of our birth dates and a random number generator to create a list of names for further consideration.   Okay, let’s give it a go…

First, define some variables. Their use will be apparent later.  Obviously replace 999999 with the real values.

get = 100 # how many names do you want returned?
wife_bday = 999999
my_bday = 999999
due_date = 999999
aatc = 999999 # address at time of conception
size = (wife_bday + my_bday) / (due_date / aatc)
start_letters = ['V','M'] # restrict names to those that start with these letters, can leave as empty list if no restriction desired
sex = 'F' # F or M

Then import the necessary modules.

import matplotlib
import numpy as np
import matplotlib.pyplot as py
import math
import scipy.stats as sps

Define a couple of variables to hold the names and abundance data, then read the file from the SSA.

p = [] # this will hold abundance
names = [] # this will hold the names
            
with open('yob2014.txt', 'r') as names_in:
    for line in names_in:
        line = line.rstrip()
        line = line.split(',')
        if line[1] == sex:
            if len(start_letters) > 0:
                if line[0][0] in start_letters:
                    n = float(line[2])
                    p.append(float(n))       
                    names.append(line[0])
            else:
                n = float(line[2])
                p.append(float(n))       
                names.append(line[0])

Excellent. Now the key feature of my method is that it biases against both very rare and very common names. To take a look at the abundance distribution run:

py.hist(p, bins = 100)

figure_1Ignore the ugly X-axis.  Baby name abundance follows a logarithmic distribution; a few names are given to a large number of babies, with a long “tail” of rare baby names.  In 2014 Emma led the pack with 20,799 new Emmas welcomed into the world.  My approach – I have no idea if it’s at all valid, so use on your own baby with caution – was to fit a normal distribution to the sorted list of names.  I got the parameters for the distribution from the geometric mean and standard deviation (as the arithmetic mean and SD have no meaning for a log distribution).  The geometric mean can be calculated with the gmean function, I could not find a ready-made function for the geometric standard deviation:

geo_mean = sps.mstats.gmean(p)
print 'mean name abundance is', geo_mean

def calc_geo_sd(geo_mean, p):
    p2 = []

    for i in p:
        p2.append(math.log(i / geo_mean) ** 2)
    
    sum_p2 = sum(p2)
    geo_sd = math.exp(math.sqrt(sum_p2 / len(p)))
    return(geo_sd)
    
geo_sd = calc_geo_sd(geo_mean, p)
print 'the standard deviation of name abundance is', geo_sd

## get a gaussian distribution of mean = geo_mean and sd = geo_sd
## of length len(p)

dist_param = sps.norm(loc = geo_mean, scale = geo_sd)
dist = dist_param.rvs(size = sum(p))

## now get the probability of these values

print 'wait for it, generating name probabilities...'
temp_hist = py.hist(dist, bins = len(p))
probs = temp_hist[0]
probs = probs / sum(probs) # potentially max(probs)

At this point we have a list of probabilities the same length as our list of names and preferencing names of middle abundance. The next and final step is to generate two pools of possible names. The first pool is derived from a biased-random selection that takes into account the probabilities, birth dates, due date, and address at time of conception. The second, truly random pool is a subset of the first with the desired size (here 100 names).

possible_names = np.random.choice(names, size = size, p = probs, replace = True)
final_names = np.random.choice(possible_names, size = get, replace = False)

And finally, print your list of names! I recommend roulette or darts to narrow this list further.

with open('pick_your_kids_name.txt', 'w') as output:
    for name in final_names:
        print name
        print >> output, name
Posted in Uncategorized | 3 Comments

Introducing PAPRICA

I’m very excited to report that our latest paper – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula was just published in the journal PLoS one.  The paper builds on two very distinct bodies of work; a growing literature on microbial community structure and function along the climatically sensitive West Antarctic Peninsula, and a family of new techniques to predict community metabolic function from 16S rRNA gene libraries, which we are calling metabolic inference.

The motivation for metabolic inference is in the large amount of time that it takes to manually curate a likely set of functions for even a small collection of 16S rRNA genes.  In today’s world, where most analyses of microbial community structure consist of many thousand of reads representing hundreds of taxa, it is simply impossible to dig through the literature on each strain to see what metabolic role each is likely to be playing.  Ideally a researcher would use metagenomics or metatranscriptomics to get at this information directly, but it is not advisable or desirable in most cases to sequence hundreds of metagenomes or metatranscriptomes (necessary for the kind of temporal or spatial resolution many of us want these days).  Metabolic inference provides a convenient alternative.

A quick Google Scholar survey of the number of studies since 2005 that have used high throughput 16S rRNA gene sequencing.

A quick Google Scholar survey of the number of studies since 2005 that have used high throughput 16S rRNA gene sequencing.  Over the last ten years we’ve collected an astonishing amount of sequence data from a diverse array of environments, however, much of this data has been from taxonomic marker genes like the 16S rRNA gene, leaving microbial community function largely unknown.  PAPRICA and other methods that try to infer microbial functional potential from 16S rRNA gene data can help bridge this gap.

The basic concept behind all metabolic inference techniques (e.g. PICRUSt, tax4fun, PAPRICA) is hidden state prediction (HSP) (you can find a nice paper on HSP here).  In 16S rRNA gene analysis metabolic potential is a hidden state.  The metabolic inference techniques propose different ways to predict this hidden state based on the information available.

Our small contribution to this effort was to develop a method (PAPRICA – PAthway PRediction by phylogenetIC plAcement) that uses phylogenetic placement to conduct the metabolic inference instead of an OTU (operational taxonomic unit) based approach.  Our approach provides a more intuitive connection between the 16S rRNA analysis and the HSP (or at least it does in my mind) and can increase the accuracy of the inference for taxa that have a lot of sequenced genomes.

Most analysis of large 16S rRNA datasets rely on an OTU based approach.  In a typical OTU analysis an investigator aligns 16S rRNA reads, constructs a distance matrix of the alignments, and clusters the reads at some predetermined distance.  By tradition the default distance has become a dissimilarity of 0.03.  This approach has some advantages.  By clustering reads into discrete units it is easy to quantify the presence or absence of different OTUs, and it allows microbial ecologists to avoid problems with defining prokaryotic species (which defy most of the criteria used to define species in more complex organisms).  To conduct a metabolic inference on an OTU based analyses it is possible to simply reconstruct the likely metabolism for a predefined set of OTUs based on the OTU assignments of published genomes.  This works great, but it limits the resolution of the inference to the selected OTU definition (i.e. 0.03).  For some taxa, such as Escherichia coli (and plenty of more interesting environmental bugs), there are many sequenced genomes that have very similar 16S rRNA gene sequences.  PAPRICA provides a way to improve the resolution of the metabolic inference for these taxa.

Our approach was to build a phylogenetic tree of the 16S rRNA genes from each completed genome.  For each internal node on the reference tree we determine a “consensus genome”, defined as all genomes shared by all members of the clade originating from the node, and predict the metabolic pathways present in the consensus and complete genomes using Pathway-Tools.  To conduct the actual analysis we use pplacer to place our query reads on the reference tree and assign the metabolic pathways for each point of placement to the query reads.  One advantage to this approach is that the resolution changes depending on genomes sequence coverage of the reference tree.  For families, genera, and even species for which lots of genomes have been sequenced resolution is high.  For regions of the tree where there are not many sequenced genomes resolution is poor, however, the method will give you the best of what’s available.

Fig_2

Figure from Bowman and Ducklow, 2015.  PAPRICA includes a confidence scoring metric that takes into account the relative plasticity of different genomes.  In this figure each vertical line is a genome (representing a numbered terminal node on our reference tree), with the height and color of the vertical line giving its relative plasticity (which we refer to as the parameter phi).  The genomes identified with Roman numerals are all known to be exceptionally modified, which is a nice validation of the phi parameter.  Many of these are obligate symbionts.  I) Nanoarcheum equitans II) the Mycobacteria III) a butyrate producing bacterium within the Clostridium IV) Candidatus Hodgkinia circadicola V) the Mycoplasma VI) Sulcia muelleri VII) Portiera aleyrodidanum VIII) Buchnera aphidicola, IX) the Oxalobacteraceae.

PAPRICA provides some additional helpful pieces of information.  We built in a confidence scoring metric that takes into account both predicted genomic plasticity and the size of the consensus genome relative to the mean size for the clade (deeper branching clades will have a bigger difference), and predicts the size of the genome and number of 16S rRNA gene copies associated with each 16S rRNA gene, both of which have a strong connection to the ecological role of a bacterium

For our initial application of PAPRICA we selected a previously published 16S rRNA gene sequence dataset from the West Antarctic Peninsula (our primary region of interest).  One thing that we were very interested in looking at was whether we could describe differences between microbial communities organized along ecological gradients (e.g. inshore vs. offshore, or surface vs. deep water) in terms of metabolic structure in place of the more traditional 16S rRNA gene (i.e. taxonomic) structure.  Using PAPRICA to convert the 16S rRNA gene sequences into collections of metabolic pathways we found that we could reconstruct the same inter-sample relationships identified by an analysis of taxonomic structure.  This means that a microbial ecologist can, if they choose, disregard the messy and sometimes uninformative taxonomic structure data and go directly to metabolic structure without losing information.  Applying common multivariate statistical approaches (PCA, MDS, etc.) to metabolic structure data yields information like which pathways are driving the variance between sites, and which are correlated with what environmental parameters.  This information is much more relevant to most research questions than the distribution of different microbial taxa.  It is worth noting that while inter-sample relationships are well preserved in metabolic structure, the absolute distance between samples is much less than for taxonomic structure.  This might have some implications for the functional resilience of microbial communities, which we get into a little bit in the paper.

PAPRICA was an outgrowth of a couple of other papers that I’m working on.  At some point the bioinformatic methods reached a point where separate publication was justified.  As a result, and reflecting the fact that I’m much more an ecologist than a computational biologist, PAPRICA is not nearly as streamlined as PICRUSt (which is even available through an online interface).  I’ve spent quite a bit of time, however, trying to make the scripts user friendly and transportable.  Anyone should be able to get them to work without too much difficulty.  If you decide to give PAPRICA a try and run into an hitches please let me know, either by posting an issue in Github or emailing me directly!  Suggestions for improvement are also welcome.

Posted in paprica | 9 Comments