Video output from R script

Posted on June 1, 2013 by Jeff

As an offshoot from one of my dissertation projects I’m developing a simple model to explore protein evolution. The model takes an amino acid sequence and mutates it, selecting for mutations that result in improvement for one of several measurable protein parameters (flexibility, hydropathy, aliphatic index, isoelectric point, or aromaticity). I’m using it to explore how selection for one parameter results in unintended changes to the other parameters. I implemented the model in Python but I’m using R for data analysis and plotting. For each run I have several hundred measures of the five different protein parameters. There are a few ways to visualize this, but since the dataset represents a temporal progression it makes sense to create an animation of the output. After some exploration this is the method that I came up with.

Assuming that you have some table of data (evol_param_d_matrix_1 in this example) in R for which a row represents a single time point, you can generate a figure (I chose a barplot) for each time point as such:

jpeg(filename = 'animations/Rplot%03d.jpg')
for (i in 1:length(evol_param_d_matrix_1[,1])){
  barplot(evol_param_d_matrix_1[i,],
          ylim = c(-0.5, 1.5),
          axisnames = T
          )
}
dev.off()

Note that it is extremely important to specify ylim (and xlim, if this axis might vary for your data) to prevent the scale of the plot changing from image to image. The above code saves each plot (one per line in evol_param_d_matrix_1) as a jpeg to the directory “animations”, labeled Rplot000.jpg, Rplot001.jpg, etc. To convert this sequence of still images to video I used the free tool (for linux/mac) ffmpeg. For mac install using homebrew. I called ffmpeg using:

ffmpeg -r 15 -qscale 1 -i 'Rplot%03d.jpg' test.mp4

That command tells ffmpeg to use 15 frames per second, with a quality of “1” (the highest quality), on all jpegs matching the given naming pattern. I first tried to use Rplot*.jpg, but this ordered the frames alphanumerically, rather than numerically. The output file is of course test.mp4. Changing to test.avi would instead output in that format. For reasons unknown to me, neither the command tee nor the redirect could capture ffmpegs screen output to a logfile, which might have been useful for the initial troubleshooting.

The ffmpeg documentation is, unfortunately, rather dense and written for people with a lot more audio/visual background than I have. It looks like ffmpeg is a commonly used tool however, so there are quite a few easy(ish) to follow tutorials online. Here’s the final product:

ffmpeg animation of barplots produced in R.

Posted in Research | 1 Comment

Land of Lakes

Posted on May 10, 2013 by Jeff

We just finished a crazy few days of sampling and I’m only now, back in Seattle, getting a moment to assimilate all of it. The Icy World team from JPL arrived late in the day on May 1, and we wasted no time getting started with field work. The primary mission was to re-sample a number of shallow, methane rich lakes that the team had sampled in October, just as the lakes were starting to freeze. What we quickly discovered at the first lake we visited however, was that a particularly cold winter and unusually thin snow cover had frozen most of the lakes straight through to the bottom. This made it impossible to recover instruments deployed in the fall or to sample lake water. Fortunately not all of the team’s work depended on lake water. We recovered ice cores from several lakes and measured the flux of methane from the boreholes.

No water in this lake. Kevin Hand and Andrew Klesh prepare to section an ice core that extends all the way to the lake sediment.

The ice from these lakes is very interesting. As it forms it traps bubbles of methane; the ice looks like a frozen glass of champagne. If you find a spot directly over an active methane “seep” the bubbles can be huge. In one spot our drill broke through to a methane pocket large enough to choke out the gas powered drill motor. We gave that hole the “lighter test” – a borehole with a high methane concentration will actually catch fire. This one burned intensely for around five seconds!

Photo courtesy of the JPL Icy Worlds project. This photo was taken while we were working at the ice edge. I like it because it shows something I hadn’t considered before – ice algae growing on the top of rafted sea ice. Normally we think of growth as occurring only on the bottom of sea ice. There is a lot of rafted ice along pressure ridges however, perhaps we’ve been underestimating the space available for ice algae to grow.

After two days of sampling lake ice we switched things up and returned to the ice edge near where I had sampled frost flowers on April 30. The lead was open again and the whaling crew had their boats in position for the first whales of the season. It was extremely kind of them to allow us temporary use of their forward whaling camp – the Bowhead whales started to appear while we were there and they could have viewed our presence as a distraction or intrusion (for the whales and the whalers). In addition to collecting more frost flowers and taking some spectacular images of the ice edge it was great to check out the whaling operation up close. The hunters really have whale behavior dialed. When the lead opened we were given a prediction of when the whales would arrive – which was accurate to within a couple of hours, and were told that the Beluga whales would lead the larger Bowheads by about two hours. Not long after the Belugas started swimming past us at a leisurely pace we saw Bowheads breaching in the distance!

Lewis Brower, ready for the hunt to begin. Lewis Brower first introduced me to sea ice in 2009. Many thanks Lewis, and good luck this season! I was told that the hide boat that he is in is around 60 years old – that boat has seen a lot of whales and a lot of change to Barrow.

It’s coming right for us! A Beluga dives under the ice edge. Not long after the Belugas started passing us we could see Bowhead whales breaching in the distance.

The next day (now starting to show a little exhaustion) we returned to the lakes, this time finding some with a little water left in them. In addition to ice and water samples the team collected some amazing footage of the under-ice lake ecosystem, including a couple of sizable whitefish shown in the video below (Video courtesy of the Nasa Astrobiology Institute’s Icy Worlds project at JPL). When I see a lot of biomass like that (also represented by the sizable microbial mats in the lake) I think energy. The interesting thing that this video illustrates to me is the amount of energy that can be contained in a shallow water column underneath a lot of ice – though of course these lakes are most energy rich in the summer, when they are well oxygenated and full of phytoplankton fixing carbon. In the winter however, the flux of methane to these lake should serve as a continued carbon source – with methanotrophic bacteria feeding small protists, which are fed on by larger lake organisms. That’s a testable hypothesis and one that would be worth following up on.

One more day of lake sampling (and a couple of long nights in the lab) saw us to the end of the field effort. It was one of my most memorable trips to Barrow, and a very successful sampling effort. Many thanks to our guides at Umiaq, and to Icy Worlds for letting me tag along…

Posted in Research | Leave a comment

Cracks in the ice

Posted on May 3, 2013 by Jeff

We made another effort at reaching the ice edge last night, but things have gotten a little odd out there. About 500 meters short of the ice edge we encountered some of the whaling crew hanging out around a 2 foot wide crack in the ice. They’d been watching the crack for some time, trying to gauge it’s progress. These cracks are really significant, and can signal that a large piece of the pack ice is getting ready to head out to sea (the opposite of the rafting phenomenon that we observed on Tuesday). Normally these cracks start to form when the wind kicks up, but there was no wind yesterday. The hunters speculated that the current alone was starting to move the ice around. During this exchange we spotted a polar bear moving parallel to the ice edge (the second we’ve seen in two days of sampling).

That is in fact a polar bear. All my polar bear pictures are some variation on this theme, I’m not sure how the people at National Geographic do it…

There was a quick discussion about whether we wanted to proceed across the crack and onto the potentially mobile flow, but my bear guard* and I were both curious about what was happening at the ice edge itself. We dropped a radio with the hunting party and asked them to call us if the crack moved, rigged a measuring bar for the crack, and proceeded cautiously. Within 200 meters we entered an entire network of cracks and realized there was no way to proceed. We turned back to do what sampling we could at the new crack, which at least afforded an opportunity to collect seawater.

The third crack we encountered. Clearly the ice is on the move, and we are not going to reach the ice edge today!

I spent most of the day in the lab today, processing all the samples collected over the last two days. A few hours ago the Icy Worlds team from JPL rolled in, and I’ll be helping them for the rest of the trip. Their project is focused on the methane rich lakes that dot the tundra here (a good analogue for Europa), an environment that I have not explored yet. I’m looking forward to it!

*In Antarctica the safety personnel request not to be named in photos, blog entries, Facebook posts, etc. They are stuck between the rock of rigid restrictions and the hard place of scientists wanting to meet an objective, and at times are willing to bend in favor of science. Since this can involve a creative interpretation of the rules, it’s better not to call them out. Although this is much less of an issue in the Arctic I’m keeping this as a general policy.

Posted in Barrow 2013 field season | Leave a comment

Breaking ice in the hot sun…

Posted on May 1, 2013 by Jeff

I fought the ice and the ice won? Had a long, good day on the ice yesterday (and yes, it was quite hot). There was a little chaos at the start, as there always is. Barrow depends in no small part on hunting bowhead whales for food. In addition to being an important cultural touchstone it’s a lot more economical (and ecologically friendly) than shipping in protein from southern Alaska or the lower 48. Whaling is possible in Barrow because the bowhead whales follow the circumpolar flaw lead that exists all around the periphery of the Arctic Ocean, at the interface between the landfast ice and pack ice. The Barrow polynya is part of this flaw lead system – and the place I hope to sample young sea ice, frost flowers, and seawater. Reaching the polynya, for scientists and for hunters, is not straightforward. In a bad year several miles of extremely jumbled, broken, and sometimes shifting landfast ice has to be crossed to get there. To enable hunting the whaling crews start cutting trails to the lead edge in March. The whaling captains are fiercely territorial about the trails – many of their dollars (the whaling captains are generally businessmen and civic leaders) and many hours go in to building them.

Last week it looked like there was an unused trail at the north end of the polynya that we might be able to use. This is far from the normal whaling area, and therefore our presence would have a minimal impact on whaling activity. At the last minute however, one of the whaling crews decided to abandon their old site and occupy this trail. At first it wasn’t clear which whaling crew had made this move, and it looked for a while like this field effort might be dead on arrival. As fate would have it however, it turned out to be the crew to which the bear guard tasked to our project belonged, moreover his dad was the crew’s captain. The trail wasn’t quite finished yet, but the word was that they would reach the ice edge within a few hours. If we helped them break trail they would probably look favorably on a little science work at the ice edge. Time to break some ice.

Cutting a “pass” through one of many pressure ridges.

The basic idea with trail building is to create little passes through the pressure ridges (often 10-15 feet high) big enough for a snowmachine pulling a whaling boat (a hide boat perhaps 15 feet long and 5 feet wide). Ice is a really satisfying medium for this kind of construction: the boulders give enough resistance to make you want to work at them, but progress is quick and visible. After three hours on the chain gang we could see the lead edge from the top of the pressure ridge. The whaling crew stopped for lunch and suggested we go on ahead and sample. With a little trial and error we reached the lead edge and got to work. There were plenty of frost flowers, though they were melting a bit under warm temps, so we made plans to visit the site again later that night when it was cooler.

$The lead edge. We sampled on the small fracture in the foreground, the larger lead in the background is also covered in frost flowers.$

The lead edge. We sampled on the small fracture in the foreground, the larger lead in the background is also covered in frost flowers.

I should know by now that nothing stays the same at the lead edge over a period of even a few hours. When we returned to the edge around 9pm we found that the wind had shifted, “rafting” the young sea ice that we had previously sampled and covering any open water. With the sun low on the horizon it was a beautiful site, but bad for forming new frost flowers and for whaling. We stayed at the lead edge for a few hours, taking pictures and sampling some of the remaining frost flowers.

Know your subject. Getting up close and personal with some frost flowers.

The lead edge after the wind shifted. Plenty of frost flowers, but now they’re high and dry on top of rafted ice floes.

Today I’m in the lab, prepping yesterday’s samples and getting ready to head back out tonight (after a heavy dose of ibuprofen). Conditions don’t look good. The wind hasn’t changed direction and, even worse, a light snow is falling. We will head out anyway and see what we find. You never can tell!

Posted in Barrow 2013 field season | Leave a comment

Barrow or Bust!

Posted on April 29, 2013 by Jeff

I’ll be taking off for Barrow in just a few hours, the last field effort of my dissertation – and my last chance at collecting frost flowers! This will be my 6th trip up in the last four years, although two of those trips were pretty short, when conditions made work impossible or impractical. Here’s a short photo summary of our previous Barrow efforts.

April, 2009. My first time working on sea ice, and our first frost flowers that made it into a publication (Bowman and Deming, GRL, 2010).

April, 2009. My first time working on sea ice, and our first frost flower samples that made it into a publication (Bowman and Deming, GRL, 2010). Many thanks to Hans-Werner Jacobi and Florent Domine for the invitation to participate in their field effort.

Our first serious, NSF funded effort to look at the microbiology of frost flowers, in February, 2010. This was a painful effort, with pretty extreme weather while we were there. There was no open water anywhere, and whenever we tried to open an artificial plot to grow frost flowers it would get quickly covered by blowing snow.

Our first serious, NSF funded effort to look at the microbiology of frost flowers, in February, 2010. This was a difficult trip, dominated by pretty extreme weather. There was no open water anywhere, and whenever we tried to open an artificial plot to grow frost flowers it would quickly be covered by blowing snow. A major blizzard pinned us down towards the end of the trip, the airport was closed for days and the road from the research station to town became impassible. At one point, with our truck stuck in a snowdrift, my lab mate Jesse and I watched snow drifts build around the truck faster than we could dig them down (we eventually abandoned the truck, fortunately we weren’t too far from the station).

Following our failed effort in February of 2010, we received an email in April that natural frost flowers were forming. I went up the next day and sampled from this lead, these are the samples in our 2013 paper (Bowman et al., EMIR, 2013). That night it snowed a foot, eradicating all frost flowers. If I’d been a day later I would have missed them.

2011, another tough year. Ice conditions that winter were extreme, and the whaling crews were having a very difficult time cutting trails through the landfast ice. I went up in March, which turned out to be much too early, and spent several frustrating days searching for a path out to the lead. We covered a lot of ground and found nothing other than these (very blurry) polar bears!

After a few days of searching I went home, and came back in April after the ice conditions had improved. This massive wall of ice shows how bad the weather was that year. It’s a pressure ridge created by the wind stacking pans of ice on top of one another, then sheared in half by a major storm. We did end up finding frost flowers, sort of. Everything was just weird that year. No new frost flowers were growing where we could reach them (the wind keep blowing them to places we couldn’t go), and the ones we could reach were very low in salinity.

So there we are… three field seasons over four years, five trips, and two publications (we are working on a third from the 2010 samples, the metagenomics project that I’ve written about here). I’ve been nervously watching the weather for the last few days and am cautiously optimistic. Right now it’s pretty warm (18F/-8C) and snowing lightly, not good, but tomorrow it should be cold and precipitation free. I’d really like to find some week-old frost flowers, which isn’t going to happen with today’s snow, but the chances of finding new ones are good.

I’m really interested in connecting the unexpected microbial community that we observed in 2010 to the wider environment however, which means that even if I don’t find frost flowers this field season can be a success. I’ll be sampling open water, young sea ice, and even lake water (working with colleagues on a NASA project who will arrive a few days after me) – environments that are much less dependent on conditions than frost flowers!

Posted in Research | Leave a comment

Greenland might not be green…

Posted on April 17, 2013 by Jeff

But you can grow flowers there. As readers of this blog know, one of the Deming Lab’s major research directions is the microbiology of the sea ice surface – frost flowers, saline snow, and related features. Since the sea ice surface changes over time (collecting fresh snow, among other things) the best time for a microbiological investigation is when the ice is relatively young. Of course this presents many challenges. In addition to being thin – and therefore difficult to travel over – the formation of young sea ice is difficult to predict. An Arctic bay that hosted a beautiful field of frost flower during a given month one year might be open water, or 50 cm thick ice, the next!

In 2012 Jody Deming traveled to Daneborg in northeast Greenland to take part in an exploratory field effort with colleagues in the Arctic Science Partnership (ASP). Among other things they were trying to assess the potential for Daneborg as a long-term study site, and trying to develop methods for bringing together the tools and techniques of many different disciplines to better understand the sea ice ecosystem. There is a polynya (a place where the ice remains thin, or even non-existent, throughout the winter) in the fjord at Daneborg, an ideal situation for studying young sea ice. The team was thwarted however, by weather and ice conditions. Unable to reach the natural polynya they made their own (check out this post to see what happened when we tried this in Antarctica)!

This video from Aarhus University in Denmark has some great footage of the artificial “pond”, and a short interview with Jody, talking about some of her research interests.

Frostflower from Aarhus Universitet on Vimeo.

Posted in Research | Leave a comment

A little Europa here in Washington

Posted on April 12, 2013 by Jeff

I got an email the other day from a colleague who teaches science at Soap Lake High School, a rural high school in eastern Washington that, in my opinion, punches above its weight in the sciences. One of his students is preparing a presentation on Europa, and she had some questions for a Europa expert. I’m not a Europa expert, but I interact with people who are, so I suppose I’m only one degree removed from actual knowledge of Europa…

Soap Lake High School is located (predictably) near Soap Lake; a highly alkaline, highly stratified saline lake. Because of the lake’s physical structure and chemistry it is a useful analogue for the Europan ocean. It’s not Lake Vostok, but no analogy is perfect. Europa is a pretty hot topic in astrobiology, and there is plenty of information out there on it, but I figured it wouldn’t hurt to post my answers to her questions here. It also lets me expand on the original answers a little.

Why do scientists think that life could be on Europa?

Scientists think that life needs two things – liquid water and a source of energy. Europa is one of the few places in our solar system (other than Earth) where both of these things are likely abundant. There is a lot of evidence that underneath Europa’s ice shell is a huge ocean, much larger in volume than the oceans on Earth. Because Europa is so close to Jupiter, and because of Jupiter’s massive size, there is a huge gravitational force exerted on Europa. This gravitational force heats Europa much in the way that Earth is heated by radioactive decay in its core, possibly enough to produce low-level volcanism at the sea floor. As on Earth, seafloor volcanism produces springs, black smokers, and other sources of reduced elements, useful for life as an energy source (when paired with oxidized material). On Earth these volcanic sites are hotspots for life.

For energy, life requires both oxidized and reduced material. Oxidized material should be abundant at the surface of Europa’s ice shell, thanks to constant bombardment by radiation. There is lots of evidence that Europa’s ice shell turns over gradually, bringing oxidized material into the ocean and exposing fresh ice to radiation. On Earth sea ice is a hotspot for photosynthetic life because sunlight is abundant – like blacks smokers it is energy rich. A similar situation could exist on Europa at the interface between ice and water, since this environment is optimally located between the oxidized ice and the reduced ocean. If Europa were a battery the seafloor and ice surface would be the negative and positive terminals (respectively). Life at the interface acts as a circuit, using the electrons for biological processes as they flow from the negative to the positive terminal (from seafloor derived material to ice surface derived material).

From Seeking Europa’s Ocean (Pappalarado, 2010). This cut away shows many putative features of the Europan ice shell, including the strong temperature gradient, convective diapirs, and the surface features produced by these internal processes.

How is Soap Lake similar to Europa? How is it different?

There is a lot of “chemoautotrophic” life in Soap Lake. This is life that, rather than using the sun as a primary energy source (in the case of plants by creating reduced carbon and free oxygen), uses reduced and oxidized chemicals – the negative and positive battery terminals discussed before. These chemicals may have a biological source (for example ammonia or nitrate or organic carbon) or a geological source (like iron). If the chemical being used is not organic (does not contain carbon-hydrogen bonds), life is said to live lithoautotrophically – this kind of life can theoretically happen in the absolute absence of sunlight (no photosynthesis going on anywhere in the environment). Finding locations on Earth where this is happening is challenging, because photosynthesis influences almost everything here. On Europa there is essentially no light that can be used in photosynthesis, so we would expect a lot of lithoautotrophy. Without photosynthesis the ecosystem of Soap Lake would look pretty different, but there are good examples of microbial life there that can serve as a model for life on Europa. One example is bacteria that gain energy via sulfur oxidation, where reduced sulfur compounds are oxidized by oxygen.

If there is life of Europa, what do you think it looks like?

I think it would look pretty much like life on Earth – or at least microbial life from the right habitats on Earth. Evolution seems to produce the same solutions to problems over and over again – this idea is called convergent evolution – and the problems life would face on Europa are not so different than what it faces on Earth. It seems likely that evolution would produce the same basic structures and metabolic strategies to deal with these problems.

Do YOU think there’s life on Europa, and what will it mean if there is?

I think Europa is the most likely place in the solar system, other than Earth, for life to exist. The fact that it is so far from Earth (unlike Mars) makes this even more exciting. If we find life on Europa there is almost no chance that it came from Earth, it would have to have had a separate origin. This would tell us a lot about how widespread life is in the universe. Finding life on Mars would be a little different, chances are that it would have something to do with life on Earth (Earth and Mars exchange asteroids and other material all the time, providing a vector for biological contamination).

How soon do you think it will be before we know for sure if there’s life on Europa?

Unfortunately it will be a long time before we know if there is life on Europa – a couple of generations if we maintain an interest in exploring the outer solar system. The most we’ve ever been able to study Europa came with the Galileo spacecraft which was sent to study the entire Jupiter system. It flew past Europa multiple times and from those observations we know a little about the composition of the moon and the likelihood of liquid water. We’ve never had a spacecraft in orbit around Europa however, let alone landed on its surface! Since life is most likely to exist below several miles of hard, thick ice even getting to the surface won’t tell us if life exists. Somehow we need to land on the surface and melt or drill through all that ice into the ocean below (warm ice rising in the shell might also contain life, but that would still involve a lot of drilling). This is technically feasible but would be very very expensive to do. We might find tantalizing evidence of life from a spacecraft in orbit or on the surface, perhaps in the form of chemicals that were most likely produced by life, but this isn’t absolute proof.

Posted in Research | Leave a comment

Some things should be easy…

Posted on April 2, 2013 by Jeff

******************
Updated method posted here: https://www.polarmicrobes.org/?p=859
******************

But they’re not. Getting meaningful taxonomy from blast results is one of those things. Here’s the situation: I ran a blastn search locally against the NCBI nt (nucleotide) database, saving the output in the (somewhat) easy to manipulate xml format. I’m interested in catching as many of the close homologues to my queries as possible, so I allowed up to 500 matches per search. The resulting xml file is pretty substantial. What I want to do now is identify the genus for each of my hits. It would be absolutely fantastic if NCBI would include the taxonomic information in the blast results – if would be easy to do on their end – but they don’t. C’est la vie.

Solution the first (the ugly)

Since the hit definitions (hit_def tag in the xml) contain the scientific name of the species of origin this can be parsed by regular expressions to produce something that is probably, most of the time, the genus name.

with open('big_blast.xml','r') as xml:
        for line in xml:
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = re.sub('<Hit_def>', '', line)
                line = re.sub('', '', line)
                hit_def = line.lower()
                hit_genus = re.split(' ', hit_def)[0]

But this will on occasion produce some weirdness that is not the genus name, and I doubt the exact format of hit_def is stable over time. It would be nice to tap into the NCBI taxonomy database to get this information right from the source.

Solution the second (the fat)

The NCBI taxonomy database can be acquired from the NCBI ftp site at ftp://ftp.ncbi.nih.gov/pub/taxonomy/. The large gi_taxid_nucl.dmp.gz file is the key to the NCBI taxonomy safe. It is a simple tab delimited text file mapping a gi number to a taxonomy id number. With the taxonomy id number (and a little voodoo) you can obtain a full taxonomy for any gi that has one. Since the gi number is conveniently embedded in the hit id section of the xml (hit_id tag, pry it out with some more regular expressions) the full taxonomy seems so, so close…

A casual user of Python might look at gi_taxid_nucl.dmp (with head, after using gunzip) and think dictionary. If the whole thing can be crammed into a dictionary with the gi numbers as keys, the taxid can be easily obtained for any given gi number. Problem is, it can’t be crammed into a dictionary – at least without a lot of trouble. Python dictionary creation and operation is memory intensive. 24 gigs of RAM is not nearly enough for this. I did manage to create a dictionary on a high memory machine (150 gigs, it failed when I initially set ulimit -v to 100 gigs) by iterating across each line in gi_taxid_nucl.dmp, saving the result as a pickle (a what you say? That’s what I said too. Turns out that pickles, in addition to being tasty fermented cucumbers, are binary flat files. These files can store things like Python lists and dictionaries), but the resulting pickle couldn’t be utilized on the 24 gig machine. I’m a guest on the high memory cluster and am reluctant to hog it for what seems like a simple operation, so I needed a different solution…

Solution the third (the slow)

I don’t have a lot of experience with SQL databases, but I know that this is the intended environment for the NCBI taxonomy. The SQL workhourse mySQL looked like overkill, so I went with sqlite. Python has a nice interface for working, via sqlite, with SQL databases so this seemed like a natural way to go. I created a database for gi_taxid_nucl.dmp as such:

import sqlite3
conn = sqlite3.connect('gi_taxid_nucl.db')
c = conn.cursor()
c.execute('''CREATE TABLE gi_taxid
(gi, taxid)''')

with open('gi_taxid_nucl.dmp', 'r') as map_file:
	for line in map_file:
		line = line.split()
		print line[0], line[1]
		c.execute("INSERT INTO gi_taxid VALUES ("+line[0]+","+line[1]+")")

conn.commit()
conn.close()

The SQL database isn’t much larger than the original tab-delimited file, and didn’t take too much memory to build. It completed in a couple of hours. To get data back out of the database I tried:

import sqlite3
conn = sqlite3.connect('gi_taxid_nucl.db')
c = conn.cursor()
t = (123,)
for row in c.execute('SELECT * FROM gi_taxid WHERE gi=?', t):
	print row

That should pull out the row corresponding to the gi number 123. And it does. But it takes a long time. I didn’t time it, but I’m guessing it would take a few weeks to run tens of thousands of queries. I’m not sure if there is a faster way to work with sqlite. It feels like an SQL database should be the solution, but it hasn’t worked out yet…

Solution the fourth (the least ugly, fat, and slow)

The fastest way to parse gi_taxid_nucl.dmp (that I can come up with) is with grep. It takes grep only a second or two to match a gi number to a line in gi_taxid_nucl.dmp, which is pretty amazing. My final script is below. I iterate across a list of gi numbers obtained from my blast output (reformatted into a table, see this post https://www.polarmicrobes.org/?p=753), using grep to match each to a line in gi_taxid_nucl.dmp. From these lines I can create a nice, reasonably sized dictionary from which I can quickly obtain the taxid when I need it. The part of the script that actually generates the taxonomy comes from this post: https://groups.google.com/forum/#!msg/qiime-forum/PXem7gMltao/oKOxJ_zFD4cJ.

from cogent.parse.ncbi_taxonomy import NcbiTaxonomyFromFiles
import subprocess

subprocess.call('rm blast_taxid.txt', shell = True)

print 'building dictionary'

n = 0
with open('blast_output_table.txt', 'r') as blast_table:
    for each in blast_table:
        n = n + 1
        print n
        each = each.split('\t')
        subprocess.call('grep --max-count=1 \"'+each[3]+'\" gi_taxid_nucl.dmp | tee -a blast_taxid.txt', shell = True)

map_dict = {}           
with open('blast_taxid.txt') as gi_taxid:
    for each in gi_taxid:
        each = each.split()
        gi = each[0]
        taxid = each[1]
        map_dict[gi] = taxid
        print gi, map_dict[gi]

print 'reading in files'

tree = NcbiTaxonomyFromFiles(open('nodes.dmp'), open('names.dmp'))
root = tree.Root

print 'generating lineages'

def get_lineage(node, my_ranks):
    ranks_lookup = dict([(r,idx) for idx,r in enumerate(my_ranks)])
    lineage = [None] * len(my_ranks)
    curr = node
    while curr.Parent is not None:
        if curr.Rank in ranks_lookup:
            lineage[ranks_lookup[curr.Rank]] = curr.Name
        curr = curr.Parent
    return lineage

ranks = ['genus']
output = open('blast_lineage.txt', 'w')
error = open('error.txt','w')
with open('blast_output_table.txt', 'r') as input:
    for line in input:
        if line[0:5] != 'query':
            try:
                line = line.split('\t')
                taxid = int(map_dict[line[3]])
                node = tree.ById[taxid]
                tax = get_lineage(node, ranks)
                tax = str(tax[0]).lower()
                print >> output, line[0]+'\t'+tax+'\t'+line[2]+'\t'+line[5],
                print line[0], tax.lower()
            except KeyError:
                print 'key error', line[3]

output.close()
error.close()

I watched a show last night featuring researchers who get to dart polar bears from a helicopter, then land and give them medical examinations. It looked like fun, I wonder if they want to switch dissertation topics…

Posted in Research | 7 Comments

Parsing blast xml output

Posted on March 23, 2013 by Jeff

In the previous post I described a method for efficiently blasting a medium-large dataset (33 million sequences). One down side to using blast on a dataset this size is the very large size of the output xml file, in this case over 40 Gbytes. The size of the output can be reduced by specifying the table output format (-outfmt 6) when executing blast, but this greatly reduces the amount of information you receive from the run. If you are going to the trouble of running blast on 33 million sequences you may as well retain all the information possible from each alignment, reducing the chance of needing to blast the dataset again! The xml format (-outfmt 5) is a (reasonably) user friendly and rich blast output format. If you are new to bioinformatics however it can be a little daunting – the critical information (best hit, e value, etc.) can look pretty buried in a mysterious jumble of tabs and tags. Here’s a chunk of the aforementioned xml output, corresponding to a single search:

  <BlastOutput_iterations>
      <Iteration_iter-num>1
      <Iteration_query-ID>Query_1
      <Iteration_query-def>HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147
      <Iteration_query-len>70
      <Iteration_hits>
      <Iteration_stat>
          <Statistics_db-num>17771239
          <Statistics_db-len>44780205193
          <Statistics_hsp-len>30
          <Statistics_eff-space>1769882720920
          <Statistics_kappa>0.46
          <Statistics_lambda>1.28
          <Statistics_entropy>0.85
      <Iteration_message>No hits found

What we have here is some kind of nested format for storing information, identified by tags (<some_tag>), which is all that xml is. Subordinate pieces of information are more indented than their parents, so in this example all the information belongs to BlastOutput_iterations. If you are familiar with blast you will recognize some of the tags. Iteration_query-def for example, gives the name of the query (the cumbersome read id HWI-ST1035:124:D1H8TACXX:2:1101:1236:2147). The tag Iteration_message is self explanatory, it tells us that this iteration returned no significant matches in the database. Had there been a match additional tags detailing the match (or matches) would have been included.

Fortunately this use of tags makes it pretty easy to parse blast xml output using regular expressions. I wrote a short script to convert my massive xml output into a less massive (but still a little cumbersome) tab delimited file from which I can extract specific lines corresponding to a hit using grep. The basic strategy for scripting xml is to create a variable for each tag of interest and march line by line through the xml, re-saving the variable anytime you encounter the corresponding tag of interest. When the outermost (least subordinate) tag of interest is encountered (or the last in a known sequence of tags is encountered, as done here), print each variable to a line in an output file.

import re

output = open('test_blast.txt','w')
n = 0
print >> output, 'read'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e'

with open('blast.xml','r') as xml:
        for line in xml:
            if re.search('<Iteration_query-def>', line) != None:
                n = n + 1
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_query-def>')
                line = line.rstrip('</')
                query_def = line
            if re.search('No hits found', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Iteration_message>')
                line = line.rstrip('</')
                print n
                print >> output, query_def+'\t'+line
            if re.search('<Hit_def>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_def>')
                line = line.rstrip('</')
                hit_def = line
            if re.search('<Hit_accession>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hit_accession>')
                line = line.rstrip('</')
                hit_acc = line       
            if re.search('<Hsp_evalue>', line) != None:
                line = line.strip()
                line = line.rstrip()
                line = line.strip('<Hsp_evalue>')
                line = line.rstrip('</')
                e_val = line
                print n
                print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 

output.close()

If, having successfully parsed the xml file, I want to access all the reads from my metagenome with a hit containing “nifH” in the definition I can extract that information relatively quickly from my table using:

grep 'nifH' test_blast.txt

Since the information that grep returns will include the query id, a new fasta containing the reads with a blast hit of “nifH” is just a short script away. One tip… since most of the reads will not have a blast hit you can create a new, smaller table containing only those reads with hits by identifying lines with an e value.

 grep 'e-' test_blast.txt > small_test_blast.txt

Posted in Research | 2 Comments

Mega megablast – conducting a massively parallel blast

Posted on March 22, 2013 by Jeff

The blast algorithm is, with good reason, one of the most widely used tools in bioinformatics. It allows a reasonably quick comparison between a set of sequences and a (potentially very large) database or set of sequences in a fasta file (slower than using a formatted database). For very small datasets the NCBI maintains an online blast interface, for datasets with hundreds to thousands of sequences it is best to install the blast executable and databases locally. A wide selection of pre-formatted database are available on the NCBI ftp site. But what about using blast for metagenomic analysis? The relatively small metagenome that I’m working on has 33,000,000 sequence reads, after quality control. Blast is fast considering the vast number of calculations required to compare all of these reads against even a modestly sized database, but it would still take over a week to run this analysis using blastn (the simplest flavor of blast) against the NCBI nt (nucleotide) database on a single computation node with 12 cpus. There is a parallel version of blast that makes use of the MPI libraries to split the analysis across multiple nodes, but compiling parallel blast looks a little daunting (I haven’t tried yet). Fortunately there is a better solution available for users of the UW’s Hyak cluster and similar systems.

On Hyak individual investigators, research groups, or academic units purchase computation nodes for three year deployments. Because the computational needs of our group are limited – I’m currently the only one doing bioinformatics – we don’t own any nodes. Fortunately the College of the Environment owns 11 nodes for trial use by investigators who might purchase nodes of their own. They also allow these nodes to be used by students in research groups that don’t have a long term need for nodes, or for whom purchasing nodes isn’t economically feasible (thank you COE!). This means that I have access to 11 12 processor nodes. As a free loader on the system I try to tread lightly on these resources, and mostly succeed. To blast my metagenome on this system I would need to tie up all the COE nodes for days which, in addition to being rude, would certainly jeopardize my continued use of Hyak.

For this kind of analysis the real power of Hyak comes in access to the backfill queue. The backfill queu allows users to submit any number of jobs to any idle node in Hyak, whether they own the node or not. This doesn’t help with memory intensive operations like assembly, but it works great on computationally intensive operations like blast. The basic idea is to split my fasta file of 33,000,000 reads into smaller files that will blast in less than four hours (a requirement for use of the backfill queu). A little testing determines that 330,000 reads will blast, using megablast against the NCBI nt database and all 12 cpus available on a single node, in about an hour and a half. Great! I split my fasta using (script can be found here):

python split_large_multifasta.py 44_trimmed_combined_h20.fasta 100

Then I use a second Python script to create a shell script for each file, and submit that shell script to the backfill queue:

import subprocess
for i in range(0,101):
	i = str(i)
	output = open(i+'.sh', 'w')
	print >> output, '#!/bin/bash'
	print >> output, '#PBS -N "'+i+'"'
	print >> output, '#PBS -d /gscratch/coenv/bowmanjs/barrow_metagenome_working/large_blast'
	print >> output, '#PBS -l nodes=1:ppn=12,feature=12core,walltime=4:00:00'
	print >> output, '#PBS -M bowmanjs@u.washington.edu'
	print >> output, '#PBS -m abe'
	print >> output, 'cp split_fasta/44_trimmed_combined_h20.fasta_'+i+'.fasta /gscratch/coenv/bowmanjs/databases/'
	print >> output, 'cd /gscratch/coenv/bowmanjs/databases'
	print >> output, 'blastn -task megablast -max_target_seqs 5 -evalue 1e-5 -db nt -outfmt 5 -out '+i+'.xml -query 44_trimmed_combined_h20.fasta_'+i+'.fasta -num_threads 12'
	print >> output, 'mv '+i+'.xml ../barrow_metagenome_working/large_blast/'
	print >> output, 'rm 44_trimmed_combined_h20.fasta_'+i+'.fasta'
	output.close()
	subprocess.call('chmod a+x '+i+'.sh', shell = True)
	subprocess.call('qsub -q bf '+i+'.sh', shell = True)

Notice that there’s a funny bit in the above script where the query fasta file is copied to the directory with the database, and then deleted after the analysis. This is because getting blast to look for the database in anything other than the working directory is quite difficult (one bad quirk of blast – supposedly you can fix it in your blast profile but this never works for me), I find it easier just to work in the directory where I stash all my database files and then move everything out. This isn’t optimal but it works. At any rate executing the above script creates and submits 100 jobs to the backfill queue. There are some minor delays while the queuing system sorts itself out, and while waiting for idle nodes, but the whole thing is complete in about 4 hours. Combine all the xml outputs with:

cat *.xml > combined.xml

Now the bad part. At this point you have a behemoth xml file with all your data, 40+ gigabytes in my case. The bits you want can be pulled out with a new script which I’ll detail in a separate post.

Posted in Research | Leave a comment

The Bowman Lab