Alignment and phylogenetic inference with hmmalign and RAxML-ng

RAxML is one of the most popular programs around for phylogenetic inference via maximum likelihood. Similarly, hmmalign within HMMER 3 is a popular way to align amino acid sequences against HMMs from Pfam or created de novo. Combine the two and you have an excellent method for constructing phylogenetic trees. But gluing the two together isn’t exactly seamless and novice users might be deterred by a couple of unexpected hurdles. Recently, I helped a student develop a workflow which I’m posting here.

First, define some variables just to make the bash commands a bit cleaner. REF refers to the name of the Pfam hmm that we’re aligning against (Bac_rhodopsin.hmm in this case), while QUERY is the sequence file to be aligned (hop and bop gene products, plus a dinoflagellate rhodopsin as outgroup).

REF=Bac_rhodopsin
QUERY=uniprot_hop_bop_reviewed

Now, align and convert the alignment to fasta format (required by RAxML-ng).

hmmalign --amino -o $QUERY.sto $REF.hmm $QUERY.fasta
seqmagick convert $QUERY.sto $QUERY.align.fasta

Test which model is best for these data. Here we get LG+G4+F.

modeltest-ng -i $QUERY.align.fasta -d aa -p 8

Check your alignment!

raxml-ng --check --msa $QUERY.align.fasta --model LG+G4+F --prefix $QUERY

Oooh… I bet it failed. Exciting! In this case (using sequences from Uniprot) the long sequence descriptions are incompatible with RAxML-ng. Let’s do a little Python to clean that up.

from Bio import SeqIO

with open('uniprot_hop_bop_reviewed.align.clean.fasta', 'w') as clean_fasta:
    for record in SeqIO.parse('uniprot_hop_bop_reviewed.align.fasta', 'fasta'):
        record.description = ''
        SeqIO.write(record, clean_fasta, 'fasta')

Check again…

raxml-ng --check --msa $QUERY.align.clean.fasta --model LG+G4+F --prefix $QUERY

If everything is kosher go ahead and fire up your phylogenetic inference. Here I’ve limited bootstrapping to 100 trees. If you have the time/resources do more.

raxml-ng --all --msa $QUERY.align.clean.fasta --model LG+G4+F --prefix $QUERY --bs-trees 100

Superimpose the bootstrap support values on the best ML tree.

raxml-ng --support --tree $QUERY.raxml.bestTree --bs-trees $QUERY.raxml.bootstraps

And here’s our creation as rendered by Archaeopteryx. Some day I’ll create a tree that is visually appealing, but today is not that day. But you get the point.

Posted in Computer tutorials | Tagged , , | Leave a comment

New paper on using machine learning to predict biogeochemistry from microbial community structure

Congratulations to Avishek Dutta for his paper Machine Learning Predicts Biogeochemistry from Microbial Community Structure in a Complex Model System that was recently published in the journal Microbiology Spectrum. I’m really excited about this paper; the study it is based on inspired this perspective that I wrote for an mSystems early career special issue last year.

Summary of experimental design and analysis, from Dutta et al., 2022.

The figure above summarizes the experimental design and analysis. The experiment was designed to address the question of whether the microbial community contains sufficient information to predict a biogeochemical state in a dynamic system. The structure of a microbial community is highly sensitive to environmental change. Small changes in the chemical or physical environment will result in a shift in abundance of one or more taxa as mortality and growth rates respond. These shifts in structure are easily observed by amplicon sequencing of taxonomic marker genes. These relative abundance data can be combined with flow cytometry analysis of microbial abundance to yield absolute abundance data.

The trick of course is relating an observed shift in community structure to a specific biogeochemical state. Machine learning provides a number of ways to do this, but all require large training datasets. Fortunately gene sequencing is pretty cheap these days and DNA extractions are much more high-throughput than they were just a few years ago. Because of this it’s possibly to generate community structure data for hundreds of samples in relatively short order. In this study Avishek used over 700 samples from sediment bioreactors and the random forest algorithm to predict the concentration of hydrogen sulfide with a reasonably high degree of accuracy.

Like any statistical model, developing machine learning models takes careful attention to detail. Careful segregation of the data into training and validation sets and engineering of the features used for prediction yield the most honest models that can be best applied for future predictions. Avishek’s paper is an excellent template for developing a predictive machine learning model from microbial community structure data.

Posted in Research | Tagged , , | Leave a comment

New paper on protein adaptations to high salinity and low temperature

Congratulations to Luke Piszkin (now a PhD student in the Biophysics Department at the University of Notre Dame) for the first paper in the lab to be first-authored by an undergraduate! Luke’s paper is titled Extremophile enzyme optimization for low temperature and high salinity are fundamentally incompatible and appears in the journal Extremophiles. In the paper Luke explores the molecular basis underlying the intriguing observation that there appear to be very few (no?) extreme halophiles that are also extreme psychrophiles, despite the fact that there are many environments on Earth that are both cold and salty.

Deep Lake Antarctica: cold and salty, but dominated by archaea with a surprisingly high optimal growth temperature. Image from http://www.lateralmag.com/articles/issue-7/the-cold-case-of-deep-lake with credits to Ricardo Cavicchioli.

One of these environments is Deep Lake, Antarctica, which supports a microbial community dominated by the mesophilic archaeon Halorubrum lacusprofundi (optimal growth temperature of 36 °C). That’s rather surprising given that your typical true psychrophile conks out at about 18 °C. Like all haloarchaea, what H. lacusprofundi can do is tolerate high levels of salt, up to 4.5 M NaCl or 262 g L-1. That level of salt tolerance is not seen among the documented true psychrophiles. Why not?

In the manuscript we posit that it comes down to the different amino acid substitutions needed to adapt a protein to high salt or low temperature conditions. High salt proteins typically have low isoelectric points, derived from more acidic amino acids. The practical implication of this is that they have a more negatively charged surface that requires a high concentration of salt for stability. This is a requirement for the “salt-in” strategists that dominate the most saline environments (such as salt crystallizer ponds). These microbes are primarily archaea but include a few bacteria, and deal with the high salinity of their environment by accumulating high intracellular concentrations of the salt KCl. This maintains their osmotic balance while excluding more harmful salts, but requires proteins that are compatible with high concentrations of KCl. By contrast most halotolerant bacteria (including psychrophiles that inhabit moderate salinity environments) are “salt-out” strategists that accumulate organic solutes to maintain osmotic balance. These solutes impose no particular requirements on intracellular proteins.

The trick is that amino acid substitutions that lead to a lower isoelectric point also decrease the flexibility of the protein. Increased flexibility is the key protein adaptation to low temperature. Thus the fundamental incompatibility between optimization to low temperature and high salinity. To test this idea Luke dusted off a model, the Protein Evolution Parameter Calculator (PEPC), that I developed many years ago in the waning days of my PhD. After updating the code from Python 2 to Python 3 and making some other improvements, Luke devised an experiment to “evolve” core haloarchaea orthologous group (tucHOG) proteins from H. lacusprofundi and the related mesophile Halorubrum salinarum. By telling the model to select for increased flexibility or decreased isoelectric point he could identify how improvements in one parameter impacted the other. As expected, likely amino acid substitutions (based on position in the protein and the BLOSUM80 substitution matrix) that increased flexibility also strongly favored an increased isoelectric point.

From Piszkin and Bowman, 2022. The directed evolution of tucHOG proteins from H. lacusprofundi and H. salinarum. The proteins were forced to evolve toward increasing flexibility while monitoring the resulting change in isoelectric point.
Posted in Research | 1 Comment

New paper on detecting successful mitigation of sulfide production

Congrats to Avishek Dutta for his new paper “Detection of sulfate-reducing bacteria as an indicator for successful mitigation of sulfide production” currently available as an early view in Applied and Environmental Microbiology. This was intended to be the second of two papers on a complex experiment that we participated in with BP Biosciences, but the trials and tribulations of peer review led this to be the first. We’re pretty excited about it.

Here’s the quick background. When microbes run out of oxygen the community turns to alternate electron acceptors through anaerobic respiration. One of these is sulfate, which anaerobic respiration reduces to hydrogen sulfide. In addition to smelling bad hydrogen sulfide is pretty reactive and forms sulfuric acid when dissolved in water. For industrial processes this is a problem. Sulfide can destroy products, inhibit desired reactions, and corrode pipes and equipment. To make matters worse, sulfate reducing bacteria (SRBs: those microbes that are capable of using sulfate as an alternate electron acceptor) can form tough biofilms that are hard to dislodge.

One way of dealing with undesired SRBs is to fight biology with biology and add a more preferential electron acceptor. Oxygen would of course work really well, but it typically isn’t feasible to implement oxygen injection on a really large scale. However, nitrate also works well. If nitrate is abundant nitrate reducing bacteria (NRBs) will outcompete SRBs for resources (e.g., labile carbon). Great! Now here’s the challenge… adding massive quantities of nitrate salts is expensive and likely has it’s own ecologically and environmental consequences. So we’d like to do this judiciously, adding just enough nitrate to the system to offset sulfate reduction. But how to know when you’ve added enough? In a really big system (like an oil field) the sulfide production can be happening very far from any possible sampling site so simply measuring the concentration of hydrogen sulfide doesn’t help much. But we can learn some useful things by monitoring the microbial community in the effluent.

Schematic of biofilm dispersal, leading to a recognizable signal in the effluent. From Dutta et al., 2021.

The figure above is a schematic of the formation and decay of the biofilm before, during, and after mitigation. In our study the biofilm was presumed to be sulfidogenic and the mitigation strategy was addition of nitrate salts, but the concept applies equally well to any biofilm and any mitigation strategy. The trick – and this is one of those things that seems painfully obvious after the fact but not before – is that you’re looking for the thing you’re mitigating to appear in the effluent. Although this might seem to suggest increased abundance in the system, it actually represents decay of the biofilm and loss from the system. To take this a step further we used paprica to predict genes in the effluent and then identified anomalies in the abundance of genes involved in sulfate reduction. This anomalies provide specific markers of successful mitigation and a means to a general strategy for monitoring the effectiveness of mitigation.

The detection of anomalies in the predicted abundance of relevant genes provides a way to detect the successful mitigation of SRBs (or any biofilm forming microbes). From Dutta et al., 2021.
Posted in Research | Leave a comment

New paper connecting aerosol optical depth to sea ice cover and ocean color

Congratulations to Srishti Dasarathy for her first first-authored publication! Srishti’s paper “Multi-year Seasonal Trends in Sea Ice, Chlorophyll Concentration, and Marine Aerosol Optical Depth in the Bellingshausen Sea” is out in advance of print in JGR Atmospheres. This paper was a really long time in coming. For this study, Srishti made use of several different satellite products including measurements of marine aerosol optical depth (MOAD) derived from the CALIPSO satellite. We are not a remote sensing lab and Srishti doesn’t come from a remote sensing or physics background, so the learning curve was pretty steep. It took a couple of years, a lot of Matlab tutorials, and an internship with the CALIPSO team at NASA’s Langley Research Center just to crack the CALIPSO data and start testing hypotheses. Srishti’s main hypothesis was that MOAD would be positively correlated with ocean color and negatively correlated with sea ice, since phytoplankton are known to be a source of volatile organic compounds that can form aerosol particles. Confounding this is that sea spray – which like phytoplankton is associated with open water periods – is also a source of aerosols.

The CALIPSO satellite “curtain”. Figure taken from https://www.globe.gov/web/s-cool/home/satellite-comparison/how-to-read-a-calipso-satellite-match.

One challenge that we faced was that CALIPSO represents data with high spatial resolution along a 2D path or “curtain”, as shown above. The orbital geometry is such that not every point on the globe gets covered; the same curtains get sampled every 16 days. Thus, while spatial resolution is high along the curtain, it is poor orthogonal to the curtain, and temporal resolution is limited to 16 days. This makes it a bit challenging to capture signals associated with relatively ephemeral events (such as phytoplankton blooms).

Basin-scale averages of MOAD, chlorophyll a, ice cover, and wind speed. From Dasarathy et al. 2021.

To work around these limitations Srishti took a basin-scale view of the CALIPSO data and looked for large scale trends that would link CMOD with chlorophyll a or ice cover. This approach isn’t idea and glosses over a lot of interesting details, but it is nonetheless sufficient to reveal some interesting relationships. Most notably that MOAD and chlorophyll a are weakly but significantly correlated in a time-lagged fashion, with a delay of approximately 1 month yielded the strongest correlation. This makes sense, as the volatile organics compounds that link phytoplankton (and ice algal) communities to MAOD are thought to be maximally produced near the end of the phytoplankton bloom as the biomass starts to decay. In the near future new satellite missions like PACE and improved land/sea observing campaigns will allows us to get into the details a bit more, including direct observations of specific blooms and the time- and space-lagged MOAD response!

The strength and sign of the correlation between MAOD and sea ice cover, wind speed, and chlorophyll a change as a function of the time-lag. For chlorophyll a, the strongest correlation with MAOD is observed with a 1-month lag. We hypothesize that this corresponds to the decay of a phytoplankton bloom when we expect the emissions of volatile organic carbon compounds to be maximal.
Posted in Research | 1 Comment

Sampling mangroves in Florida’s Indian River Lagoon

Last week PhD student Natalia Erazo and I were fortunate to get back into the field after a long pandemic hiatus.  Our mission was to collect mangrove propagules (essentially a detachable bud from which the mangrove seedling sprouts) from the Indian River Lagoon in Florida for an upcoming experiment on mangrove-microbe symbiosis.  Neither of us had worked in Florida before so we teamed up with Candy Feller, an emeritus scientist with the Smithsonian Marine Station in Ft. Pierce, FL.  Candy has been working on mangroves in Florida and around the world for decades and is extremely knowledgeable about the ecology of these systems.  She and husband Ray Feller allowed us to tag along as they checked on a few long-term experiments and study sites up and down the coast.

Natalia, Candy, and I standing in a mixed salt marsh-mangrove habitat near the northern limit of the mangrove range. Photo: Ray Feller.

For those not familiar with Florida’s Atlantic coast, the Indian River Lagoon is a network of estuaries and barrier islands that stretch from north of Cape Canaveral to south of Port St. Lucie.  The barrier islands form a protected waterway that provides habitat for mangroves, manatees, and a variety of other species.  The Indian River Lagoon is home to quite a few people as well, and there are some issues associated with water quality. Nutrients from septic and sewage systems are cited as a cause of high phytoplankton loads and increasingly murky water, leading to a reduction in aquatic vegetation and increased manatee mortality.  Key landscape features in the Lagoon are also the result of human habitation.  For example, much of the mangrove habitat in the Ft. Pierce region exists within engineered mosquito abatement areas.  To reduce the number of mosquitos (currently a nuisance, but previously some did carry disease) berms were created around vast tracts of mangrove habitat.  These areas were then flooded, reducing the breeding success of mosquitos because they lay eggs on wet but not flooded soil. 

Natalia samples propagules from mangroves of the genus Avicennia in a former mosquito abatement area.

Unfortunately, mosquito abatement also killed the mangrove trees which, while salt tolerant and adapted to life in saturated soils, require tidal action to oxygenate the water.  Modern mosquito abatement efforts (while still energy and labor intensive) take this into account and mangroves are thriving in areas that were formerly stagnant abatement ponds.  This is a Good Thing for anyone who likes fish, crabs, shoreline stabilization, and any of the other services that mangroves are well known for providing.

A particularly interesting feature of the Indian River Lagoon is that it is oriented north to south at nearly the northernmost known extent of mangroves on the US Atlantic Coast.  This provides an excellent opportunity to study how mangroves are responding to changing climate.  It’s known that mangroves are extending their range to the north, but climate change is anything but linear, and the rise in atmospheric and sea surface temperatures are accompanied by instabilities and severe perturbations.  The most notable may be freezing events caused by deep intrusions of the now infamous polar vortex.  Such perturbations can have a bigger impact on landscape ecology than the background climate.  Mangroves are very much a tropical species but somewhat resistant to transient freeze events (at least more so than your average Florida orange tree).  How they respond physiologically to these and other stressors that they encounter in their northward progression remains to be seen.

Mangrove trees near the southern end of the Indian River Lagoon. There are no salt marsh habitats in the region, mangrove forests dominate the estuaries.
Salt marsh (with pulp mill in the background) at Fernandina Beach, well north of the current known mangrove range in Florida. Eventually this salt marsh will convert to mangrove forest similar to the previous picture, but the timeline on which this will occur is anyone’s guess.
Posted in Research | Leave a comment

New paper on microbial community structure in coastal Southern California

Congrats to postdoctoral researcher Jesse Wilson for his new paper in Environmental Microbiology Recurrent microbial community types driven by nearshore and seasonal processes in coastal Southern California. Although considerable microbiology work has taken place at the Ellen Browning Scripps Pier this is (surprisingly) the first study to comprehensively look at how bacterial and archaeal community structure change over time. This is also the first of what we hope to be many publications that are a product of the Scripps Ecological Observatory.

Jesse Wilson (left), Avishek Dutta (right), and I prep an in situ sampling pump for the Scripps Ecological Observatory.

As part of the Scripps Ecological Observatory effort we team up with the Southern California Coastal Ocean Observing System (SCCOOS) team for twice-weekly sampling of surface water for microbial community structure via 16S and 18S rRNA gene sequencing and microbial abundance via flow cytometry. As you can see from the SCCOOS and flow cytometry data below it’s a pretty dynamic system! This is why the site is so advantageous for ecological studies; more dynamic means more opportunities to identify co-variants in the environment that signal possible interactions.

From Wilson et al., 2021. Key ecological parameters and flow cytometry data for the Ellen Browning Scripps Pier for an ~18 month period.

At the core of Jesse’s paper is the 16S rRNA gene sequence dataset. What these data provide is a high resolution view of the taxonomy of the bacterial and archaeal community at each sample point. These data are so high resolution – after proper denoising and quality control they represent hundreds to thousands of unique taxa – that it’s often difficult to make inferences from them. Techniques are applied to reduce the complexity of the data and make it easier to see patterns.

From Wilson et al., 2021. Two different techniques were applied to the 16S rRNA gene dataset to reduce the complexity of the microbial community and allow patterns to emerge. The panel at the top shows the occurrence of taxonomic “modes” (our term for SOM-derived classes). The panel at the bottom shows the occurrence of subnetworks in a WGCNA analysis.

Jesse approached the problem from perspectives of both the observations (sampling days) and variables (microbial taxa). For microbial time-series data it is much more common to aggregate variables. A widely used approach involves a technique known as weighted gene correlation network analysis (WGCNA), originally developed for gene expression studies. WGCNA uses network analysis to combine taxa into subnetworks or modules that have like co-occurrence patterns. One advantage to this approach is that the subnetworks are easily correlated to external variables that either drive the pattern (e.g., physical processes) or are influenced by it (e.g., ecophysiology). A disadvantage is that these correlations aren’t predictive. You can’t readily classify new data into the existing subnetworks, and the co-occurrence patterns of the subnetworks themselves contain additional information that isn’t readily captured by this approach.

In a 2017 paper we demonstrated how self-organizing maps (SOMs) can be used to more explicitly link environmental parameters with microbial community structure. SOMs are a form of neural network and collapse complex, multi-dimensional data into a 2D representation that retains the major relationships present in the original data. The end result of the SOM training process is a 2D model of the data that can be further subdivided into distinct classes. Applied to community structure data (i.e. in microbial community segmentation) the SOM flips the aggregation problem, aggregating samples instead of taxa. That means that each unique sample point can be described by the model as a single discrete variable that nonetheless captures much of the key information present. A major advantage to this approach is that the model is reusable: new data can be very efficiently assigned to existing classes, which is a key advantage for an ongoing ecological monitoring effort.

Results of a “microbial community segmentation” using SOMs. A graphical representation of the model is shown in A. B-E show the association of the microbial modes with different ecological parameters.

This paper is an exciting but very early effort to track microbial processes at the Scripps Ecological Observatory. The time-series presented here ends in June of 2019 – the date our original (and terrible) flow cytometer terminally failed – but twice-weekly data collection have continued. We now have three years of 16S and 18S rRNA gene sequence and flow cytometry data and this collection will continue as long as we’re able to support it! Students and potential postdocs interested in microbial time-series analysis should take note…

Many thanks to the Simons Foundation Early Career Investigator in Marine Microbial Ecology and Evolution program for supporting this work, and to all the SCCOOS technicians and Bowman Lab personnel for bringing us water and processing samples!

Posted in Research, Scripps Ecological Observatory | Leave a comment

New paper on microbial life in hypersaline environments

Congrats to Benjamin Klempay for his first, first-authored publication in the lab! (wow, didn’t I just write that??) Benjamin is part of the Oceans Across Space and Time (OAST) project and his paper, Microbial diversity and activity in Southern California salter and bitterns: analogues for ancient ocean worlds, appears in a special issue of the journal Environmental Microbiology. In the paper Benjamin does a deep dive into the microbial diversity of the network of lakes that make up the South Bay Salt Works, a little known industrial site/wildlife refuge on San Diego Bay that also happens to be the oldest continually operating solar salt harvesting facility in the US.

OAST team members Maggie Weng, Benjamin Klempay, and Peter Doran at the SBSW in 2020.

Our interest in hypersaline lakes – aside from that fact that they just really weird and fun environments to explore – is their value as analogues for evaporative environments on Mars and other ancient ocean worlds. Once upon a time Mars was wet, and may not have been so dissimilar to many environments on Earth today. As that water was lost the oceans, lakes, and wetlands were reduced by evaporation to saline lakes and ultimately salt pans. These end-state evaporative environments are key targets for Martian exploration today. Extremely salty lakes like those found at the Salt Works are a reasonable representation of the last potentially inhabited environments on the surface of Mars before it became too desiccated to support life. Thus the signatures of ancient Martian life might bear some similarities to contemporary life in these lakes.

From https://www.nasa.gov/press/2015/march/nasa-research-suggests-mars-once-had-more-water-than-earth-s-arctic-ocean. Mars was once a wet world. As it dried the remnant lakes and oceans would have become increasingly saline, eventually representing hypersaline environments like the lakes of the South Bay Salt Works.

The microbial diversity of hypersaline lakes has been studied in depth – as I mentioned before they’re weird and fun places to study – but Benjamin’s work looks at a couple of unexplored elements. First, he didn’t restrict his analysis to sodium chloride lakes at the Salt Works (salterns) but also included magnesium chloride lakes (bitterns) that are thought to be too toxic for life (see a nice discussion of this in a recent OAST paper here). He found an interesting pattern of microbial diversity across these lakes, with diversity decreasing as salinity decreases, then suddenly increasing in the magnesium chloride lakes. The reason for this is the absence of microbial growth in those lakes. Rather than hosting a specialized microbial community they collect microbes from dust, seaspray, and other sources (infall), and preserve this DNA but inactivating the enzymes that would normally degrade it.

Microbial diversity in salterns and bitterns. Diversity increases below the known water activity limit for bacteria and archaea due to external inputs of new genetic material. From Klempay et al. 2021.

Co-authors Anne Dekas and Nestor Arandia-Gorostidi at Stanford also applied nano-SIMS to evaluate single-cell activity levels across the salinity (water activity) gradient. Biomass can be very high in these lakes – 100 fold or more higher than seawater – so we assumed that activity would be high too. The nice thing about nano-SIMS is that it evaluates activity on a per-cell basis. Looked at in this way, most bacteria and archaea had surprisingly low levels of activity. We’re still trying to understand exactly what this means and Anne and Nestor undertook an impressive array of experiments as part of our 2020 field effort to try to get to the bottom of it. We think that the extraordinarily low levels of predation are partially responsible; the eukaryotic protists that typically prey on bacteria and archaea can’t grow at the salinity of the saltiest lakes at South Bay Salt Works. Viruses, the other major source of mortality for bacteria and archaea, don’t generally propagate through low-activity populations. So the haloarchaea that dominate in these lakes may have hit upon a winning evolutionary strategy of slow growth under the protection of a particularly extreme environment.

Single-cell activities as measured by nano-SIMS. From Klempay et al. 2021.
Posted in OAST, Research | Tagged , , | Leave a comment

New paper on shrimp aquaculture in mangrove forests

Congrats to Natalia Erazo for her first first-authored publication in the lab! Her paper, Sensitivity of the mangrove-estuarine microbial community to aquaculture effluent, appears in a special issue of the journal iScience. The publication is the culmination of our 2017 field effort in the Cayapas-Mataje and Muisne regions of Ecuador.

Study sites in Cayapas-Mataje and Muisne, Ecuador. From Erazo and Bowman, 2021.

Ecuador is ground zero for mangrove deforestation for shrimp aquaculture. Most of Ecuador’s coastline is in fact completely stripped of mangroves. The biogeochemical consequences of this aren’t hard to imagine. Mangrove forests contain a significant amount of carbon in living biomass and in the sediment. Aquaculture ponds, by contrast, contain a large amount of nitrogen as a result of copious additions of nitrogen-rich shrimp feed. The balance of C to N is one of the fundamental stoichiometric relationships in aquatic chemistry. When it shifts all kinds of interesting things start to happen.

Shrimp aquaculture ponds in Muisne, Ecuador. Once there were mangroves…

The one place in Ecuador where you can find large areas of mangroves is the Cayapas-Mataje Ecological Reserve. CMER is in fact the largest contiguous mangrove forest on the Pacific coast of Latin America. Its status comes from an interesting combination of social and economic factors that left this part of Ecuador relatively undeveloped until recently. There is shrimp aquaculture in the reserve, but it’s nowhere near as expansive as in Muisne and other ex-mangrove sites in Ecuador.

Natalia leveraged the different levels of disturbance present in Cayapas-Mataje, and between Cayapas-Mataje and Muisne, to explore what the impact of all this aquaculture activity is on microbial community structure. After all it’s really the microbial community that responds to and drives the biogeochemistry, so understanding the sensitivity of these communities to the changing conditions gives us insight into how the system is changing as a whole.

Patterns in biogeochemistry and genomic features across the disturbance gradient in this study. Erazo and Bowman, 2021.

By using our paprica pipeline Natalia was able to evaluate changes in microbial community structure, predicted genomic content, and key genome features across the disturbance gradient. A nitrogen excess (relative to phosphorous) was associated with bacteria with larger genomes and more 16S rRNA gene copies, indicative of a more copiotrophic or fast-growing population. This has implications for how carbon is turned over or retained at the higher levels of disturbance.

Distribution of predicted metabolic pathways related to nitrogen cycling across different levels of disturbance. Erazo and Bowman, 2021.

Different microbial metabolisms are also associated with the level of disturbance. The figure above shows the distribution of predicted metabolic pathways associated with nitrogen metabolism. Nitrogen fixation, a feature of microbial symbionts of many plants, is less abundant at high levels of disturbance, while pathways associated with denitrification are more abundant. The interesting thing about this is that these samples are restricted to the mangroves themselves – the high disturbance samples don’t reflect the actual aquaculture ponds – so these changes reflect altered processes in the remaining stands of mangroves. The loss of beneficial, symbiotic bacteria and elevated abundance of putative shellfish pathogens suggests the impacts of aquaculture are not limited to the physical removal of mangrove trees and associated release of carbon.

Posted in Research | Tagged , | Leave a comment

A short tutorial on Gnu Parallel

This post comes form Luke Piszkin, an undergraduate researcher in the Bowman Lab. Gnu Parallel is a must-have utility for anyone that spends a lot of time in Linux Land, and Luke recently had to gain some Gnu Parallel fluency for his project. Enjoy!

*******

GNU parallel is a Linux shell tool for executing jobs in parallel using multiple CPU cores. This is a quick tutorial for increasing your workflow and getting the most out of your machine with parallel. ​ You can find the current distribution here: https://www.gnu.org/software/parallel/. Please try some basic commands to make sure it is working. ​ You will need some basic understanding of “piping” in the command line. I will describe command pipes briefly just for our purposes, but for a more detailed look please see https://www.howtogeek.com/438882/how-to-use-pipes-on-linux/. ​ Piping data in the command line involves taking the output of one command and using it as the input for another. A basic example looks like this: ​

command_1 | command_2 | command_3 | … 

​ Where the output of command_1 will be used as an input by command_2, command_2 will be used by command_3, and so on. For now, we will only need to use one pipe with parallel. Now let’s look at a basic command run in parallel. ​

Input: find -type f -name "*.txt" | parallel cat
Output: 
The house stood on a slight rise just on the edge of the village.
It stood on its own and looked over a broad spread of West Country farmland.
Not a remarkable house by any means - it was about thirty years old, squattist, squarish, made of brick, and had four windows set in the front size and proportion which more or less exactly failed to please the eye
The only person for whom the house was in any way special was Arthur Dent, and that was only because it happened to be the one he lived in.
He had lived in it for about three years, ever since he had moved out of London because it made him nervous and irritable

​ This command makes use of find to list all the .txt files in my directory, then runs cat on them in parallel, which shows the contents of each file on a new line. We can already see how this is much easier than running each command separately, i.e:

In: cat file1.txt
The house stood on a slight rise just on the edge of the village.
In: cat file2.txt
It stood on its own and looked over a broad spread of West Country farmland.

​ Also, notice how we do not need any placeholder for the files in the second command, because of the pipes. Now let’s take a more complicated example:

find -type f -name "*beta_gal_vibrio_vulnificus_1_100000_0__H_flex=up_*.txt" ! -name "*tally*" | parallel -j 4 python3 PEPCplots.py {} flex log
0.001759374417007663, 0.00033497120199255527, 0.9969940359705531
0.0019773468515624356, 0.00022978867370935437, 0.9969940359705531
0.001332602651915014, 0.0005953339816183529, 0.9969940359705531
0.0015118302435556904, 0.0005040931537659636, 0.9969940359705531
0.001320879258211107, 0.0006907926578169569, 0.9969940359705531
0.0016753759966792244, 0.00041583739269117386, 0.9969940359705302
0.0017187095827331082, 0.00036931151058880094, 0.9969940359705531
0.0017045099726521733, 0.00031386214441070197, 0.9969940359705531
0.001399703145023273, 0.0005196629341168314, 0.9969940359705531
0.001436129272321403, 0.0004806654291442482, 0.9969940359705531

​ This is an example from my research, it takes in a .txt data file and spits out some parameters that I want to put in a spreadsheet. Like before, we use find to get a list of all the files we want the second command to process. We use ! -name “*tally*” to exclude any files that have “tally” anywhere in the name because we don’t want to process those. In the second command, we have the option -j 4. This tells parallel to use 4 CPU cores, so it can run 4 commands at a time. You can check your computer specs to see how many cores you have available. If your machine has hyper-threading, then it can create virtual cores to run jobs on too. For instance, my dinky laptop only has 2 cores, but with hyper-threading I can use 4. This is another way to improve your efficiency. In the second command you also see a {} placeholder. This spot is filled by whatever the first command outputs. In this case, we need that placeholder because our input files go between other commands. You can also use parallel to run a number of identical commands at the same time. This is helpful if you have a program to run on the same file multiple times. For example:

seq 10 | parallel -N0 cat file1.txt
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.
The house stood on a slight rise just on the edge of the village.

​ Here we use seq as a counting mechanism for how many times to run the second command. You can adjust the number of jobs by changing the seq argument. We include the -N0 flag, which tells parallel to ignore any piped inputs because we aren’t using the first command for inputs this time. Often, I like to include both the time shell tool and the –progress parallel option to see current job status and time for completion: ​

seq 10 | time parallel --progress -N0 cat file1.txt
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
​
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:4/0/100%/0.0s The house stood on a slight rise just on the edge of the village.
local:4/1/100%/1.0s The house stood on a slight rise just on the edge of the village.
local:4/2/100%/0.5s The house stood on a slight rise just on the edge of the village.
local:4/3/100%/0.3s The house stood on a slight rise just on the edge of the village.
local:4/4/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:4/5/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:4/6/100%/0.2s The house stood on a slight rise just on the edge of the village.
local:3/7/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:2/8/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:1/9/100%/0.1s The house stood on a slight rise just on the edge of the village.
local:0/10/100%/0.1s
0.21user 0.46system 0:00.63elapsed 108%CPU (0avgtext+0avgdata 15636maxresident)k
0inputs+0outputs (0major+12089minor)pagefaults 0swaps

​ And with that, you are well on your way to significantly increasing your computing throughput and using the full potential of your machine. You should now have a sufficient understanding of parallel to construct a command for your own projects, and to explore more complicated applications of parallelization. (Bonus points to whoever knows the book that I used for the text files.)

Posted in Computer tutorials | 1 Comment