Tutorial: How to make a map using QGIS

Hi! I’m Natalia Erazo, currently working on the Ecuador project aimed at examining biogeochemical processes in mangrove forest. In this tutorial, we’ll learn the basics of (free) QGIS, how to import vector data, and make a map using data obtained from our recent field trip to the Ecological Reserve Cayapas Mataje in Ecuador!  We’ll also learn standard map elements and QGIS function: Print Composer to generate a map.

Objectives:

I. Install QGIS

II. Learn how to upload raster data using the Plugin OpenLayers and QuickMap services.

III. Learn how to import vector data: import latitude, longitude data and additional data. Learn how to select attributes from the data e.g., salinity values and plot them.

IV. Make a map using Print Composer in QGIS.

I. QGIS- Installation

QGIS is a very powerful tool and user friendly open source geographical system that runs on linux, unix, mac, and windows. QGIS can be downloaded here . You should follow the instructions and install gdal complete.pkg, numpy.pkg, matplotlib.pkg, and qgis.pkg.

II.Install QGIS Plug-in and Upload a base map.

  1. Install QGIS Plug-in

Go to Plugins and select Manage and Install plugins. This will open the plugins dialogue box and type OpenLayers Plugin and click on Install plugin.

This plugin will give you access to Google Maps, openStreet map layers and others, and it is very useful to make quick maps from Google satellite, physical, and street layers. However, the OpenLayers plugin could generate zoom errors in your maps.   There is another plug in: Quick Map Service which uses tile servers and not the direct api for getting Google layers and others. This is a very useful plugin which offers more options for base maps and less zoom errors. To install it you should follow the same steps as you did for the OpenLayers plugin except this time you’ll type QuickMap Service and install the plugin.

Also, If you want to experiment with QuickMap services you can expand the plugin: Go to Web->Quick Map Services->Settings->More services and click on get contributed pack. This will generate more options for mapping.

2. Add the base layer Map:

I recommend playing with the various options in either OpenLayers like the Google satellite, physical, and other maps layers, or QuickMap Service.

For this map, we will use ESRI library from QuickMap services. Go to–> Web- ->QuickMapServices–> Esri–> ESRI Satellite

You should see your satellite map.

You can click on the zoom in icon to adjust the zoom, as shown in the map below where I  zoom in the Galapagos Islands. You’ll also notice that on the left side you have a Layers panel box, this box shows all the layers you add to your map. Layers can be raster data or vector data, in this case we see the layer: ESRI Satellite. At the far left you’ll see a list of icons that are used to import your layers. It is important to know what kind of data you are importing to QGIS to use the correct function.

III. Adding our vector data.

We will now add our data file which contains latitude and longitude of all the sites we collected samples, in addition to values for salinity, temperature, and turbidity. You can do this with your own data by creating a file in excel  and have a column with longitude and latitude values and columns with other variables  and save it as a csv file. To input data you’ll go to the icons on the far left and click on “Add  Delimited Text Layer”. Or you can click on Layer-> Add Layer-> Add Delimited Text Layer.

You’ll browse to the file with your data. Make sure that csv is selected for File format. Additionally, make sure that X field represents the column for your longitude points and Y field for latitude. QGIS is smart enough to recognize longitude and latitude columns but double check! You can also see an overview of the data with columns for latitude, longitude, Barometer mmHg, conductivity, Salinity psu and other variables. You can leave everything else as default and click ok.

You’ll be prompt to select the coordinate reference system selector, and this is very important because if you do not select the right one you’ll get your points in the wrong location. For GPS coordinates, as the data we are using here, you need to select WGS 84 ESPG 43126.

Now we can see all the points where we collected data!

As we saw earlier, the data contains environmental measurements such as: salinity, turbidity, temperature and others. We can style the layer with our sampling points based on the variables of our data. In this example we will  create a layer representing salinity values.

You’ll right click on the layer with our data in the Layer Panel, in this case our layer: 2017_ecuador_ysi_dat.. and select properties.

The are many styles you can choose for the layer and the styling options are located in the Style tab of the Properties dialogue. Clicking on the drop-down bottom in the Style dialogue, you’ll see there are five options available: Single Symbol, Categorized, Graduated, Rule Based and Point displacement. We’ll use Graduated which allows you to break down the data in unique classes. Here we will use the salinity values and will classify them into 3 classes: low, medium, and high salinity. There are 5 modes available in the Graduated style to do this: Equal interval, Quantile, Natural breaks, Standard deviation and Pretty breaks. You can read more about these options in qgis documentation.

In this tutorial, for simplicity  we’ll use the Quantile option. This method will decide the classes such that number of values in each class are the same; for example, if there are 100 values and we want 4 classes, the quantile method decide the classes such that each class will have 25 values.

In the Style section: Select->Graduated, in Column->salinity psu, and in color ramp we’ll do colors ranging from yellow to red.

In the classes box write down 3 and  select mode–>Quantile. Click on classify, and QGIS will classify your values in different ranges.

Now we have all the data points color in the 3 different ranges: low, medium, and high salinity.

However, we have a lot of points and it is hard to visualize the data points. We can edit the points by right clicking on the marker points and select edit symbol.

Now, I am going to get rid of the black outline to make the points easy to visualize. Select the point by clicking on Simple Marker and in Outline style select the No Pen. Do the same for the remaining two points.

Nice, now we can better see variations in our points based on salinity!

IV. Print Composer: making a final map

We can start to assemble the final version of our  map. QGIS has the option to create a Print composer where you can edit your map. Go to Project -> New Print composer

You will be prompted to enter a title for the composer, enter the title name and hit ok. You will be taken to the Composer window.

In the Print composer window, we want to bring the map view that we see in the QGIS canvas to the composer. Go to Layout-> Add a Map. Once the Add map button is active, hold the left mouse and drag a rectangle where you want to insert the map. You will see that the rectangle window will be rendered with the map from the main QGIS canvas.

You can see in the far right end the Items box; this  shows you the map you just added. If you want to make changes, you’ll select the map and edit it under item properties. Sometimes it is useful to edit the scale until you are happy with the map.

We can also add a second map of the location of Cayapas Mataje in South America as a  geographic reference. Go to the main qgis canvas and zoom out the map until you can see where in South America the reserve is located.

Now go back to Print Composer and add the map of  the entire region. You’ll do the same as with the first map. Go to Layout–> Add map. Drag a rectangle where you want to insert the map. You will see that the rectangle window will be rendered with the map from the main QGIS canvas. In Items box, you can see you have Map 0 and Map 1. Select Map 1, and add a frame under Item properties, click on Frame to activate it and adjust the thickness to 0.40mm.

We can add a North arrow to the map. The print composer comes with a collection of map related images including many North arrows. Click layout–> add image.

Hold on the left mouse button, draw a rectangle on the top-right corner of the map canvas.

On the right-hand panel, click on the Item Properties tab and expand the Search directories and select the north arrow image you like the most. Once you’ve selected your image, you can always edit the arrow under SVG parameters.

Now we’ll add a scale bar. Click on Layout–> Add a Scale bar. Click on the layout where you want the scale bar to appear. Choose the Style and units that fit your requirement. In the Segments panel, you can adjust the number of segments and their size. Make sure Map 0 is selected under main properties.

I’ll add a legend to the map. Go to Layout–> add a Legend. Hold on the left mouse button, and draw a rectangle on the area you want the legend to appear. You can make any changes such as adding a title in the item properties, changing fonts and renaming your legend points by clicking on them and writing the text you want.

It’s time to label our map. Click on Layout ‣ Add Label. Click on the map and draw a box where the label should be. In the Item Properties tab, expand the Label section and enter the text as shown below. You can also make additional changes to your font, size by editing the label under Appearance.

Once you have your final version, you can export it as Image, PDF or SVG. For this tutorial, let’s export it as an image. Click Composer ‣ Export as Image.

Here is our final map!

Now you can try the tutorial with your own data. Making maps is always a bit challenging but put your imagination to work!

Here is a list of links that could help with QGIS:

-QGIS blog with various tutorials and new info on functions to use: here.

-If you want more information on how QGIS handles symbol and vector data styling: here  is a good tutorial.

-If you need data, a good place to start is Natural Earth: Free vector and raster basemap data used for almost any cartographic endeavor.

If you have specific questions please don’t hesitate to ask.

 

Posted in Computer tutorials | 20 Comments

Hunting for halophiles at the South Bay Saltworks

Hello! My name is Melissa Hopkins. I just finished my first quarter as an undergraduate researcher in the Bowman Lab. The project I am working on involves the diversity of halophiles from the South Bay Saltworks lakes in San Diego. The South Bay Saltworks is an active solar salt harvesting facility that is part of the San Diego Bay National Wildlife Refuge. The goal of this project is to use 16S/18S community structure to identify microbial taxa that are currently poorly represented in existing genomes. We want to contribute to halophile genomes to learn more about halophiles, and this helps us decide which new genomes to sequence.

Aharan Oren’s open access review Microbial life at high concentrations” phylogenetic and metabolic diversityexplains the different classes and orders that contain halophiles, as well as the similarities of strategies used by these halophiles. Here, he uses the definition of halophiles as microbes that are able to tolerate 100 g/L salt but grow optimally at 50 g/L salt (seawater contains 35 g/L salt). Extreme halophiles (including the haloarchaea) are defined as growing best at salt concentrations of 2.5-5.2 M, and moderate halophiles as growing optimally at salt concentrations of 1.5-4.0 M. This assumes that the salt is sodium chloride however, different salts such as magnesium chloride can present additional challenges to life.

Halophiles are able to survive these high salt concentration environments in two different ways: pumping salt ions into their cells from the surrounding environment, or synthesizing organic solutes to match the concentration of their surrounding environment. Synthesizing organic solutes is more energetically expensive because it requires energy to make the high concentration of organic solutes needed, thus keeping salt ions out of the cytoplasm. But, this strategy is actually found widely across halophiles species. Pumping specific salt ions into their cells that don’t interfere with biological processes is less energetically expensive, but requires proteins in the cell to be specially adapted to high salt conditions. Because of this that strategy is not seen as often across the different species of halophiles. Different families and orders of halophiles use variations of these strategies to survive. There have been some new halophile species, such as Salinbacter, that are outliers, as in they use a different survival strategy then other halophiles they are related to. Sequencing many more halophile genomes will give us new information on how these adaptive strategies across different halophiles.

For this project, Jeff, Natalia and I spent the day sampling lakes from South Bay Saltworks on October 6. Out goal was to sample from 3 different points in each of several lakes of different salt concentrations, and to sample as many lakes as possible.

We started out at the lower salinity lakes to see how the equipment would function and get used to the sampling process. At each point, we took unfiltered samples using a peristaltic pump for respiration tests, bacterial abundance, measurements of photosynthetic efficiency and chlorophyll concentration, turbidity, and salinity.

We then placed a GF/F filter on the housing of the pump to collect a coarsely filtered sample for ion composition analysis, FDOM, and dissolved inorganic nutrients. Finally, we placed a 0.2 micron filter on the housing to collect bacteria, archaea, and phytoplankton for DNA analysis.

In all, we sampled 7 lakes: one low salt concentration lake, one medium salt concentration lake, 2 high salt concentration lakes, and 3 magnesium chloride lakes. Unfortunately, due to time constraints and the high viscosity of the magnesium chloride lakes (its like trying to filter maple syrup), we were only to sample from 1 point for each of the magnesium chloride lakes.

Natalia and I setting up for sampling on one of the lower salinity lakes.  You can see the large piles of harvested salt in the background.

One of the magnesium chloride lakes we sampled. Due to the high salt concentration and extremely high attraction between water and magnesium chloride these lakes have an oily texture and high viscosity, making them difficult to sample.

One of the saltier sodium chloride lakes that we sampled from.  The pink color comes from pigments in the different microorganisms that we’re studying.

Posted in Uncategorized | Leave a comment

Ecuador update

Today we took the last sample of our Ecuador field effort, though we have a few days left in-country. Right now we are in the town of Mompiche, just down the coast from our second field site near Muisne. Tomorrow we’ll be sorting out gear and getting ready for a few days of meetings in Guayaquil. Then its time to fly home and start working up some data! I’m too tired to write a coherent post on the last two (intensive!) weeks of sampling, but here’s a photo summary of our work in the Cayapas-Mataje Ecological Reserve, where we spent the bulk of our time. Enjoy!

Pelicans congregating along a front in the Cayapas-Mataje estuary.

The town of Valdez, near the mouth of the Cayapas-Mataje estuary.  The Reserve is right on the border with Columbia, and up until a few years ago Valdez had a reputation as a trafficking hub.  Drug trafficking is still a problem throughout the Reserve, but with the conflict with FARC more or less over I understand that tension in the local communities has gone way down.  Valdez seems okay, and the people we met there were friendly.

Another view of Valdez.

Shrimp farm in the Cayapas-Mataje Ecological Reserve.  You can’t build a new farm in the Reserve, but old shrimp farms were grandfathered in when the Reserve was created.

The R/V Luz Yamira, at its home port of Tambillo.  Tambillo was a vibrant, friendly little town where we spent a bit of time.  The town is struggling to hold onto its subsistence fishing lifestyle in the face of declining fish stocks.

ADCP, easy as 1-2-3

Birds of a feather…

Morning commute from the city of San Lorenzo.

The Cayapas-Mataje Ecological Reserve has the tallest mangrove trees in the world.

I took this picture just for the good people at UCSD risk management.

Team Cayapas-Mataje.  From left; Jessie, Natalia, and Jeff.  We are standing in front of Jessie’s house in Tambillo.  Many thanks to Jessie and his wife for letting us stay a night and get away from the craziness of San Lorenzo!

A very full car ready to head to Muisne.  Its a good thing Natalia and I are both fairly short.

A large shrimp farm near Muisne.  The area around Muisne has been almost entirely deforested for shrimp aquaculture.  By comparing this area with the more pristine Cayapas-Mataje, we hope to better understand the biogeochemical consequences of coastal land-use change.

Posted in Uncategorized | 4 Comments

Initial field effort in Ecuador

As any reader of this blog will know, most of the research in the Bowman Lab is focused on polar microbial ecology. Although focusing a research program on a set of geographically-linked environments does have advantages, primarily the ability to spend more time thinking in depth about them, there is I think something lost with this approach. Insights are often gained by bringing a fresh perspective to a new study area, or applying lessons learned in such an area to places that one has studied for years. With this in mind lab member Natalia Erazo and I are launching a new field effort in coastal Ecuador. Natalia is an Ecuadorean native, and gets credit for developing the idea and sorting out the considerable logistics for this initial effort. Our first trip is very much a scouting effort, but will carry out some sampling in the Cayapas-Mataje Ecological Reserve and near the town of Muisne. Depending on funding we hope to return during the rainy season in January-February for a more intensive effort.

Dense coastal forest in the Cayapas-Mataje Ecological Reserve.  Photo Natalia Erazo.

Our primary objective is to understand the role of mangrove forests in coastal biogeochemical cycling. Mangroves are salt-tolerant trees that grow in tropical and sub-tropical coastal areas around the world. They are known to provide a range of positive ecosystem functions; serving as fish habitat, stabilizing shorelines, and providing carbon and nutrient subsidies to the coastal marine environment. Globally mangroves are under threat. The population density of many tropical coastal areas is increasing, and that inevitably leads to land-use changes (such as deforestation) and a loss of economic services – the social and economic benefits of an ecosystem – as economic activity ramps up. The trick to long-term sustainability is to maintain ecosystem services during economic development, allowing standards of living to increase in the short term without a long-term economic loss resulting from ecological failure (such as the collapse of a fishery or catastrophic coastal flooding). This is not easily done, and requires a much better understanding of what functions exactly, specific at-risk components of an ecosystem provide than we often have.

One particular land-use threat to the mangrove ecosystem is shrimp aquaculture. Mangrove forests in Ecuador and in other parts of the world have been deforested on a massive scale to make room for shrimp aquaculture ponds. In addition to scaling back any ecosystem functions provided by the mangrove forest, the shrimp aquaculture facilities are a source of nutrients in the form of excrement and excess feed. On this trip we will try to locate estuaries more and less perturbed by aquaculture. By comparing nutrient and carbon concentrations, sediment load, and microbial community structure between these areas, we will gain a preliminary understanding of what happens to the coastal ecosystem when mangroves are removed and aquaculture facilities are installed in their place.

Our first stop on this search will be San Lorenzo, a small city in the Cayapas-Mataje Ecological Reserve near the border with Columbia. I’m extremely excited to visit the Reserve, which has the distinction of hosting the tallest mangrove trees anywhere on Earth. We may some meager internet access in San Lorenzo, so I’ll try to update this blog as I’m able. Because of the remote nature of some of our proposed field sites we’ll have a Garmin InReach satellite messenger with us. We plan to leave the device on during our field outings, you can track our location in real time on the map below! The Garmin map interface is a bit cludgey; you should ignore the other Scripps Institution of Oceanography users listed on the side panel as I can’t seem to make them disappear.

Posted in Uncategorized | Leave a comment

So you want to use your computer for science…

It’s been a while since I was a new graduate student, and I’ve forgotten how little I knew about computers back then.  I was reminded recently while teaching a couple of lab members how to use ffmpeg, an excellent command line tool for building animations from images (as described in this post).  We got there, but I realized that we needed a basic computing tutorial before moving on to anything more advanced.  If you’re trying to use your laptop for science, but you’re not too sure about this whole “command line thing”, this post’s for you.  Be warned that this tutorial is intended as only the most cursory crash course to get you moving up the initial learning curve.  A comprehensive list of commands for Bash on OSX can be found here.  At the end of this tutorial you will have a basic grasp of how to:

  • Navigate the file system
  • Create and edit text files
  • Search text files
  • Create a shell script
  • Modify the PATH variable using startup files (.bash_profile)

Okay, so I want to get a computer to do some science.  What should I get?

Whatever you want.  It really doesn’t matter.  Most grad students seem to get Macs these days, which I don’t love (they’re costly and epitomize form over function), but have the slight advantage of a Unix-like environment hiding behind all that OSX gibberish.  I use a Windows machine (which I also don’t love, as it epitomizes dysfunction over function) with Cygwin, which gives me access to all the Linux tools that I need to carry out day-to-day operations.  Windows 10 users can also make use of the Bash shell add-on for Windows, but I haven’t found any advantage to this over Cygwin.  The point is that you need either a) A Mac, b) A Windows machine running Cygwin or the add-on, or c) A Linux machine.  The command prompt and output given below are what you will see in OSX (faked since I’m not actually using a Mac), but are similar to what you will get with Cygwin or one of the Linux distros.  The commands should work the same across all of these options.

Getting familiar with “Bash”

Bash stands for Bourne-again shell, which you can read all about in many other places on the web.  Bash is a very powerful tool for manipulating your computer’s file system, executing programs, and even creating programs.  It is a cornerstone of scientific computing and you should have at least some passing familiarity with it.  To open the Bash terminal in OSX, go to Applications/Utilities/Terminal in Finder.  A mysterious black (or white) window will open, with a white (black) cursor waiting for YOU.  Type “pwd” for print working directory and hit “Enter”.

jeffscomputer:~ jeff$ pwd
/Users/jeff

Bash will respond with your location, which should be something along the lines of /user/home.  Next type “ls” to list the contents of the directory.

jeffscomputer:~ jeff$ ls
bin
Desktop
Documents
Downloads

We can use the command “cd” to change directories.  For example, if you want to move to your Desktop directory type “cd Desktop”.

jeffscomputer:~ jeff$ cd Desktop
jeffscomputer:Desktop jeff$ pwd
/Users/jeff/Desktop

Because the directory “Desktop” was just one level down, in this case the relative path “Desktop” is equivalent to typing the full path “/Users/jeff/Desktop”.  Here’s a useful tip.  From any location on your system “cd ~” will get you back to your home directory.

jeffscomputer:Desktop jeff$ cd ~
jeffscomputer:~ jeff$ pwd
/Users/jeff

Now let’s create a directory to do some work.  The command for this is “mkdir temp”, for “make directory with name temp.”

jeffscomputer:~ jeff$ mkdir temp
jeffscomputer:~ jeff$ cd temp

Now move into that directory.  You already know how 🙂

Creating and editing text files

You will frequently need to create and edit basic text files without all the fancy formatting of a word processing document.  The most user friendly way to do this is with the program nano, which is likely already present if you are using OSX or Cygwin.  Type “nano temp.txt” and nano will open a blank text file with name temp.txt.

jeffscomputer:temp jeff$ nano temp.txt

Type a couple lines of text and, when you’re ready to exit and save the file, hit ctrl-x.  Nano will prompt you about saving the output, hit yes.  List the contents of the directory and notice that the file temp.txt now exists.  Type “nano temp.txt” again and rather than create a new file, nano will open temp.txt for editing.

Having gone through the trouble of creating that file, let’s go ahead and delete it using the remove or “rm” command.

jeffscomputer:temp jeff$ rm temp.txt

To do some fancier things with files lets download one that has a little more information in it.  There are two programs to use to fetch files online, wget and curl.  I find wget much easier to use than curl.  If you’re using Cygwin or Linux you likely already have it, but for OSX you first need to install a package manager, which is a whole can of worms that I’m not going to tackle in this post.  So let’s use curl to download a text file, in this case “The Rime of the Ancient Mariner” by Samuel Coleridge.  The basic download command for curl is:

jeffscomputer:temp jeff$ curl -O https://www.polarmicrobes.org/extras/ancient_mariner.txt

This should create the file “ancient_mariner.txt” in your working directory (e.g., “/Users/jeff/temp”).

Viewing file content

The reason we downloaded this more complex text file (it’s a pretty long poem) is to simulate a longer data file than our “temp.txt” file.  Very often in scientific computing you have text files with hundreds, thousands, or even millions of lines.  Just opening such a file can be onerous, let alone finding some specific piece of information.  Fortunately there are tools that can help.  Type “head ancient_mariner.txt”, this returns the top 10 lines of the file.

jeffscomputer:temp jeff$ head ancient_mariner.txt
PART THE FIRST
It is an ancient mariner,
And he stoppeth one of three.
"By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
"The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din."
He holds him with his skinny hand,

Want to guess what “tail” does?  For either command you can use a “flag” to modify behavior, such as returning more lines.  Flags are always preceded by “-” or “–“, and generally come before the positional arguments of the command, in this case the file.  This is general syntax for Unix commands, and does not apply only to “head” and “tail”.

jeffscomputer:temp jeff$ tail -15 ancient_mariner.txt
Both man and bird and beast.
He prayeth best, who loveth best
All things both great and small;
For the dear God who loveth us,
He made and loveth all.
The Mariner, whose eye is bright,
Whose beard with age is hoar,
Is gone: and now the Wedding-Guest
Turned from the bridegroom's door.
He went like one that hath been stunned,
And is of sense forlorn:
A sadder and a wiser man,
He rose the morrow morn.

In the examples so far the the command output has printed to the screen, but what if we want to capture it in a file?  For example, what if we have a huge data file (millions of rows) and we want just the top few lines to test some code or share with a collaborator?  This is easily done by redirecting the output using “>”.

jeffscomputer:temp jeff$ tail -15 ancient_mariner.txt > end_of_ancient_mariner.txt

Searching file content

To find specific information in a file use the command “grep”.  Without launching into a full-on explanation of regular expressions, grep (very quickly) finds lines that match some given pattern.  You can either count or view the lines that match.  To find the lines with the word “albatross” in ancient_mariner.txt:

jeffscomputer:temp jeff$ grep 'Albatross' ancient_mariner.txt
At length did cross and Albatross,
I shot the Albatross."
Instead of the cross, the Albatross
The Albatross fell off, and sank
The harmless Albatross."
The Albatross's blood.

Seems there’s a typo in this version of the poem (and|an)!  At any rate, use the -c flag to count the lines.  Protip: use the “up” key to bring back the previous line, which you can then modify.

jeffscomputer:temp jeff$ grep -c 'Albatross' ancient_mariner.txt
6

Use the -v flag to select against lines with “Albatross”, which you can combine with the -c or other flags:

jeffscomputer:temp jeff$ grep -cv 'Albatross' ancient_mariner.txt
634

Build a basic program

So far we’ve been executing commands manually, from the command line.  Suppose we have a set of commands that we want to execute a lot, or that need a method to document our workflow.  To do this we create a shell script.  Fire up nano for a file named “temp.sh” and type:

#!/bin/bash

echo "hello world!"
# hey, this line doesn't do anything!
f=ancient_mariner.txt
echo $f
grep -cv 'Albatross' $f

Line by line here’s what’s going on:

  • The first line is called the shebang and it tells your system what interpreter to use to run the script (in this case Bash).  /bin/bash is an actual location on your computer where the Bash program resides
  • The next line is just a bit of formality; by strictly adhered-to convention your first program should always be a little script that says “hello world!”.  It does however, illustrate the “echo” command, which prints out information.
  • The next line starts with “#” which denotes a comment.  Line that start with “#” are not read by Bash.  You can use that character to make notes, or to toggle commands on and off.
  • In the next line we assign a variable (f) a value (ancient_mariner.txt).  We can now call that variable using “$”.  The next two lines are examples of this.

To execute the script we simply type the file name into bash.  Before we do that however, we need to set the file permissions, as files are not by fault executable.  To do that we use the “chmod” command with the “a+x” options (note that this is not a flag).

jeffscomputer:temp jeff$ chmod a+x temp.sh

You can run it now, but there is one final trick.  Bash doesn’t know to look in your working directory for the script, you have to specify that that’s where it is.  The location of the current working directory is always “./”, so the command looks like this:

jeffscomputer:temp jeff$ ./temp.sh
hello world!
ancient_mariner.txt
634

Setting up your environment

Okay, we’re going to ramp things up a bit for the grand finale and modify the Bash startup files to better use your new-found skills.  There are several possible startup files, and the whole startup file situation gets a bit confusing.   We will modify the .bash_profile file, which will handle the majority of user cases, but you should take the time to familiarize yourself with the different files.  The .bash_profile file is a hidden file (denoted by the .), you can see hidden files by using “ls” with the “-a” flag.

jeffscomputer:temp jeff$ cd ~
jeffscomputer:~ jeff$ ls -a
bin
Desktop
Documents
Downloads
.bash_history
.bashrc
.profile
.ssh

Don’t worry if you don’t see .bash_profile, we will create it shortly.  First, to understand why we need to modify the startup file in the first place, from your home directory try executing the shell script that we just created.

jeffscomputer:~ jeff$ temp.sh
-bash: temp.sh: command not found

The command would execute if you typed temp/temp.sh, but remembering the location of every script and program so that you can specify the complete path to it would be silly.  Instead, Bash stores this information in a variable named PATH.  To see what’s in PATH use “echo”:

jeffscomputer:~ jeff$ echo $PATH
/usr/local/bin:/usr/bin:/usr/sbin

If you created a script in /usr/local/bin, Bash would know to look there and would find the script.  It’s a good idea however, to keep user-generated scripts in the home directory to avoid cluttering up locations used by the operating system.  What we need to do is automatically update PATH with customized locations whenever we start a new Bash session.  We accomplish this by modifying PATH on startup using the startup file.  To do this use nano, but you will need to use nano as a super user or “sudo”.

jeffscomputer:~ jeff$ sudo nano .bash_profile

As you already know, if you don’t have .bash_profile this will create it for you.  Now, at the end of the .bash_profile file (or on the first line, if its empty), type (replacing “jeff” with your home directory):

export PATH=/Users/jeff/temp:$PATH

The command structure might look opaque at first but its really not.  This line is saying “export the variable PATH as this text, followed by the original PATH variable”.

Close nano.  Recall that .bash_profile is read by Bash at the start of the Bash session.  Your newly defined PATH variable will be read if you start a new session, but you can also force Bash to read it with the “source” command.

jeffscomputer:~ jeff$ source .bash_profile

Now try to execute your bash script.

jeffscomputer:~ jeff$ temp.sh
hello world!
ancient_mariner.txt
634

Last, let’s clean things up a little.  Previously we used “rm” to remove a file.  The same command with the -r flag will remove a directory.

jeffscomputer:~ jeff$ rm -r temp

Voila!  To keep your .bash_profile file looking pretty, don’t forget to remove the line adding temp to PATH (though it does no harm).  Or you can comment that line out and leave it as an example of the correct syntax for when you next add a new location to PATH.

Posted in Computer tutorials | Leave a comment

Microbial session at POLAR 2018 in Davos

Davos, Switzerland, site of the POLAR2018 conference.  Image from https://www.inghams.co.uk

With colleagues Maria Corsaro, Eric Collins, Maria Tutino, Jody Deming, and Julie Dinasquet I’m convening a session on polar microbial ecology and evolution at the upcoming POLAR2018 conference in Davos, Switzerland.  Polar2018 is shaping up to be a unique and excellent conference; for the first time (I think) the major international Arctic science organization (IASC) is joining forces with the major international Antarctic science organization (SCAR) for a joint meeting.  Because many polar scientists specialize in one region, and thus have few opportunities to learn from the other, a joint meeting will be a great opportunity to share ideas.

Here’s the full-text description for our session.  If your work is related to microbial evolution, adaption, or ecological function in the polar regions I hope you’ll consider applying!

Posted in Uncategorized | Leave a comment

Denitrifying bacteria and oysters

The eastern oyster.  Denitrification never tasted so good!  Photo from http://dnr.sc.gov.

I’m happy to be co-author on a study that was just published by Ann Arfken, a PhD student at the Virginia Institute for Marine Science (VIMS).  The study evaluated the composition of the microbial community associated with eastern oyster Crassostrea virginica to determine if oysters play a role in denitrification.  Denitrification is an ecologically significant anaerobic microbial metabolism.  In the absence of oxygen certain microbes use other oxidized compounds as electron acceptors.  Nitrate (NO3) is a good alternate electron acceptor, and the process of reducing nitrate to nitrite (NO2), and ultimately to elemental nitrogen (N2), is called denitrification.  Unfortunately nitrate is needed by photosynthetic organisms like plants and phytoplankton, so the removal of nitrate can be detrimental to ecosystem health.  Oxygen is easily depleted in the guts of higher organisms by high rates of metabolic activity, creating a niche for denitrification and other anaerobic processes.

Predicted relative abundance (paprica) as a function of measured (qPCR) relative abundance of nosZI genes.  From Arfken et al. 2017.

To evaluate denitrification in C. virginica, Arfken et al. coupled actual measurements of denitrification in sediments and oysters with an analysis of microbial community structure in oyster shells and digestive glands.  We then used paprica with a customized database to predict the presence of denitrification genes, and validated the predictions with qPCR.

I was particularly happy to see that the qPCR results agreed well with the paprica predictions for the nosZ gene, which codes for the enzyme responsible for reducing nitrous oxide (N2O) to N2.  I believe this is the first example of qPCR being used to validate metabolic inference – which currently lacks a good method for validation.  Surprisingly however (at least to me), denitrification in C. virginica was largely associated with the oyster shell rather than the digestive gland.  We don’t really know why this is.  Arfken et al. suggests rapid colonization of the oyster shell by denitrifying bacteria that also produce antibiotic compounds to exclude predators, but further work is needed to demonstrate this!

Posted in Uncategorized | 3 Comments

Analyzing flow cytometry data with R

We recently got our CyFlow Space flow cytometer in the lab and have been working out the kinks.  From a flow cytometry perspective the California coastal environment is pretty different from the western Antarctic Peninsula where I’ve done most of my flow cytometry work.  Getting my eyes calibrated to a new flow cytometer and a the coastal California environment has been an experience.  Helping me on this task is Tia Rabsatt, a SURF REU student from the US Virgin Islands.  Tia will be heading home in a couple of weeks which presents a challenge; once she leaves she won’t have access to the proprietary software that came with the flow cytometer.  To continue analyzing the data she collected over the summer as part of her project she’ll need a different solution.

To give her a way to work with the FCS files I put together a quick R script that reads in the file, sets some event limits, and produces a nice plot.  With a little modification one could “gate” and count different regions.  The script uses the flowCore package to read in the FCS format files, and the hist2d command in gplots to make a reasonably informative plot.

library('flowCore')
library('gplots')

#### parameters ####

f.name <- 'file.name.goes.here'  # name of the file you want to analyze, file must have extension ".FCS"
sample.size <- 1e5               # number of events to plot, use "max" for all points
fsc.ll <- 1                      # FSC lower limit
ssc.ll <- 1                      # SSC lower limit
fl1.ll <- 1                      # FL1 lower limit (ex488/em536)

#### functions ####

## plotting function

plot.events <- function(fcm, x.param, y.param){
  hist2d(log10(fcm[,x.param]),
         log10(fcm[,y.param]),
         col = c('grey', colorRampPalette(c('white', 'lightgoldenrod1', 'darkgreen'))(100)),
         nbins = 200,
         bg = 'grey',
         ylab = paste0('log10(', y.param, ')'),
         xlab = paste0('log10(', x.param, ')'))
  
  box()
}

#### read in file ####

fcm <- read.FCS(paste0(f.name, '.FCS'))
fcm <- as.data.frame((exprs(fcm)))

#### analyze file and make plot ####

## eliminate values that are below or equal to thresholds you
## defined above

fcm$SSC[fcm$SSC <= ssc.ll|fcm$FSC <= fsc.ll|fcm$FL1 == fl1.ll] <- NA
fcm <- na.omit(fcm)

fcm.sample <- fcm

if(sample.size != 'max'){
  try({fcm.sample <- fcm[sample(length(fcm$SSC), sample.size),]},
      silent = T)
}

## plot events in a couple of different ways

plot.events(fcm, 'FSC', 'SSC')
plot.events(fcm, 'FSC', 'FL1')

## make a presentation quality figure

png(paste0(f.name, '_FSC', '_FL1', '.png'),
    width = 2000,
    height = 2000,
    pointsize = 50)

plot.events(fcm, 'FSC', 'FL1')

dev.off()

And here’s the final plot:

Posted in Computer tutorials, Research | Leave a comment

OPAG comes to Scripps Institution of Oceanography

Artist’s rendition of the Dawn spacecraft approaching Ceres.  Original image can be found here.

I’m excited to be hosting the fall meeting of NASA’s Outer Planets Assessment Group (OPAG) here at Scripps Institution of Oceanography in September.  For planetary scientists at UCSD, SDSU, USD, and other institutions in the greater San Diego area, if you’ve never attended the OPAG meeting here’s your chance!  The meeting will be September 6 and 7 at the Samual H. Scripps Auditorium.  More details can be found here.

What are assessment groups, and what is OPAG specifically?  The assessment groups are an excellent NASA innovation to encourage dialogue within the scientific community, and between the scientific community and NASA HQ.  There’s usually a little tense dialogue – in a good way – between these two ends of the scientific spectrum.  I often wish NSF had a similar open-format dialogue with its user community!  The form of the OPAG meeting is 15 or 30 minute presentations on a range of topics relevant to the community.  These are often mission updates, planning updates for future missions, and preliminary results from the analysis of mission data.  NASA has quite a few assessment groups, ranging from the Small Body Assessment Group (SBAG – probably the AG with the catchiest acronym) to the Mars Assessment Group (MPAG).  OPAG covers everything in the solar system further from the sun than Mars.  If that covers your favorite planetary body, come and check it out!

It’s traditional to have a public evening lecture with the OPAG meeting.  For the upcoming meeting the lecture will be given at 7 pm on September 6 at the Samuel Scripps Auditorium by my friend and colleague Britney Schmidt, a planetary scientist from Georgia Tech and an expert on Europa and on Antarctic ice sheets.  Why and how one can develop that dual expertise will certainly be made clear in her talk.  There is no cost to attend, but an RSVP is recommended.  You can find more details and RSVP here.

Posted in Uncategorized | Leave a comment

Analyzing “broad” data with glmnet

Very often environmental datasets contain far fewer observations than we would like, and far more variables that might influence these observations.  This is the case for one of my projects in the Palmer LTER: I have annual observations of an environmental indicator (n = 19) and far more than 19 environmental conditions (predictors, p) that might influence the indicator.  This is a broad dataset (n >> p), and the problem can’t be solved with classic multiple regression (for which the number of observations must exceed the number of predictors).  Enter the R package glmnet.  Glmnet is an implementation of lasso, ridge, and elastic-net regression.  There are a limited number of glmnet tutorials out there, including this one, but I couldn’t find one that really provided a practical start to end guide.  Here’s the method that I came up with for using glmnet to select variables for multiple regression.  I’m not an expert in glmnet, so don’t take this as a definitive guide.  Comments are welcome!

First, let’s get some data.  The data are various climate indices with a 13- month time lag as well as sea ice extent, open water extent, and fractional open water extent.  There are a couple of years missing from this dataset, and we need to remove one more year for reasons that aren’t important here.

env.params <- read.csv('https://www.polarmicrobes.org/extras/env_params.csv', header = T, row.names = 1)
env.params <- env.params[row.names(env.params) != '1994',]

To keep things simple we’ll create our response variable by hand. Nevermind what the response variable is for now, you can read our paper when it comes out 🙂

response <- c(0.012, 0.076, 0.074, 0.108, 0.113, 0.154, 0.136, 0.183, 0.210, 0.043, 0.082, 0.092, 0.310, 0.185, 0.436, 0.357, 0.472, 0.631, 0.502)

One thing to note at this point is that, because we have so few observations, we can’t really withhold any to cross-validate the glmnet regression. That’s a shame because it means that we can’t easily optimize the alpha parameter (which determines whether glmnet uses lasso, elastic net, or ridge) as was done here. Instead we’ll use alpha = 0.9. This is to make the analysis a little bit elastic-netty, but still interpretable.

In my real analysis I was using glmnet to identify predictors for lots of response variables.  It was therefor useful to write a function to generalize the several commands needed for glmnet.  Here’s the function:

library(glmnet)

## The function requires a matrix of possible predictors, a vector with the response variable,
## the GLM family used for the model (e.g. 'gaussian'), the alpha parameter, and "type.measure".
## See the documentation on cv.glmnet for options.  Probably you want "deviance". 

get.glmnet <- function(predictors, response, family, alpha, type.measure){
  glmnet.out <- glmnet(predictors, response, family = family, alpha = alpha)
  glmnet.cv <- cv.glmnet(predictors, response, family = family, alpha = alpha, type.measure = type.measure)
  
  ## Need to find the local minima for lambda.
  lambda.lim <- glmnet.cv$lambda[which.min(glmnet.cv$cvm)]
  
  ## Now identify the coefficients that correspond to the lambda minimum.
  temp.coefficients.i <- coefficients(glmnet.out, s = lambda.lim)@i + 1 # +1 to account for intercept
  
  ## And the parameter names...
  temp.coefficients.names <- coefficients(glmnet.out, s = lambda.lim)@Dimnames[[1]][temp.coefficients.i]
  temp.coefficients <- coefficients(glmnet.out, s = lambda.lim)@x
  
  ## Package for export.
  temp.coefficients <- rbind(temp.coefficients.names, temp.coefficients)
  return(temp.coefficients)
}

Phew!  Okay, let’s try to do something with this.  Note that glmnet requires a matrix as input, not a dataframe.

response.predictors <- get.glmnet(data.matrix(env.params), response, 'gaussian', alpha, 'deviance')
>t(response.predictors)
      temp.coefficients.names temp.coefficients      
 [1,] "(Intercept)"           "0.496410195525042"    
 [2,] "ao.Apr"                "0.009282064516813"    
 [3,] "ao.current"            "-0.0214919836174853"  
 [4,] "pdo.Mar"               "-0.0568728879266135"  
 [5,] "pdo.Aug"               "-0.00881845994191182" 
 [6,] "pdo.Dec"               "-0.0321738320415234"  
 [7,] "ow.Dec"                "1.92231198892721e-06" 
 [8,] "ow.frac.current"       "0.207945851122607"    
 [9,] "ice.Sep"               "-2.29621552264475e-06"

So glmnet thinks there are 9 predictors.  Possible, but I’m a little suspicious.  I’d like to see this in the more familiar GLM format so that I can wrap my head around the significance of the variables.  Let’s start by building a null model of all the predictors.

null.model <- lm(response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep, data = env.params)
> summary(null.model)

Call:
lm(formula = response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + 
    pdo.Dec + ow.Dec + ow.frac.current + ice.Sep, data = env.params)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.14304 -0.03352 -0.01679  0.05553  0.13497 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)      2.052e+00  1.813e+00   1.132   0.2841  
ao.Apr           2.226e-02  3.702e-02   0.601   0.5610  
ao.current      -3.643e-02  1.782e-02  -2.044   0.0681 .
pdo.Mar         -7.200e-02  3.215e-02  -2.240   0.0490 *
pdo.Aug         -1.244e-02  2.948e-02  -0.422   0.6821  
pdo.Dec         -6.072e-02  3.664e-02  -1.657   0.1285  
ow.Dec           3.553e-06  1.749e-06   2.032   0.0696 .
ow.frac.current  1.187e-01  4.807e-01   0.247   0.8100  
ice.Sep         -1.023e-05  8.558e-06  -1.195   0.2596  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09043 on 10 degrees of freedom
Multiple R-squared:  0.8579,	Adjusted R-squared:  0.7443 
F-statistic: 7.549 on 8 and 10 DF,  p-value: 0.002224

Hmmm… this looks messy.  There’s only one predictor with a slope significantly different from 0.  A high amount of variance in the original data is accounted for, but it smells like overfitting to me.  Let’s QC this model a bit by 1) checking for autocorrelations, 2) using AIC and relative likelihood to further eliminate predictors, and 3) selecting the final model with ANOVA.

## Check for autocorrelations using variance inflation factors

library(car)
> vif(null.model)
         ao.Apr      ao.current         pdo.Mar         pdo.Aug         pdo.Dec          ow.Dec ow.frac.current 
       1.610203        1.464678        1.958998        2.854688        2.791749        1.772611        1.744653 
        ice.Sep 
       1.510852

Surprisingly, all the parameters have acceptable vif scores (vif < 5).  Let’s proceed with relative likelihood.

## Define a function to evaluate relative likelihood.

rl <- function(aicmin, aici){
  return(exp((aicmin-aici) / 2))
}

## Construct multiple possible models, by adding parameters by order of abs(t value) in summary(null.lm).

model.1 <- lm(response ~ pdo.Mar, data = env.params)
model.2 <- lm(response ~ pdo.Mar + ao.current, data = env.params)
model.3 <- lm(response ~ pdo.Mar + ao.current + ow.Dec, data = env.params)
model.4 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec, data = env.params)
model.5 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep, data = env.params)
model.6 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr, data = env.params)
model.7 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr + pdo.Aug, data = env.params)

## Collect AIC scores for models.

model.null.aic <- AIC(null.model)
model.1.aic <- AIC(model.1)
model.2.aic <- AIC(model.2)
model.3.aic <- AIC(model.3)
model.4.aic <- AIC(model.4)
model.5.aic <- AIC(model.5)
model.6.aic <- AIC(model.6)
model.7.aic <- AIC(model.7)

## Identify the model with the lowest AIC score.

> which.min(c(model.1.aic, model.2.aic, model.3.aic, model.4.aic, model.5.aic, model.6.aic, model.7.aic, model.null.aic))
[1] 5

So model.5 has the lowest AIC.  We need to check the relative likelihood of other models minimizing information loss. Models with values < 0.05 do not have a significant likelihood of minimizing information loss and can be discarded.

> rl(model.5.aic, model.1.aic)
[1] 8.501094e-05
> rl(model.5.aic, model.1.aic)
[1] 8.501094e-05
> rl(model.5.aic, model.2.aic)
[1] 0.00143064
> rl(model.5.aic, model.3.aic)
[1] 0.002415304
> rl(model.5.aic, model.4.aic)
[1] 0.7536875
> rl(model.5.aic, model.6.aic)
[1] 0.4747277
> rl(model.5.aic, model.7.aic)
[1] 0.209965
> rl(model.5.aic, model.null.aic)
[1] 0.08183346

Excellent, we can discard quite a few possible mode here; model.1, model.2, and model.3.  Last we’ll use ANOVA and the chi-squared test to see if any of the models are significantly different from model.4, the model with the fewest parameters.

> anova(model.4, model.5, test = 'Chisq')
Analysis of Variance Table

Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec
Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep
  Res.Df      RSS Df Sum of Sq Pr(>Chi)
1     14 0.098626                      
2     13 0.086169  1  0.012457   0.1704

> anova(model.4, model.6, test = 'Chisq')
Analysis of Variance Table

Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec
Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + 
    ao.Apr
  Res.Df      RSS Df Sum of Sq Pr(>Chi)
1     14 0.098626                      
2     12 0.083887  2   0.01474   0.3485

> anova(model.4, model.7, test = 'Chisq')
Analysis of Variance Table

Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec
Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + 
    ao.Apr + pdo.Aug
  Res.Df      RSS Df Sum of Sq Pr(>Chi)
1     14 0.098626                      
2     11 0.082276  3   0.01635   0.5347

> anova(model.4, null.model, test = 'Chisq')
Analysis of Variance Table

Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec
Model 2: response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + 
    ow.Dec + ow.frac.current + ice.Sep
  Res.Df      RSS Df Sum of Sq Pr(>Chi)
1     14 0.098626                      
2     10 0.081778  4  0.016849   0.7247

Okay, there’s a whole lot going on there, they important thing is that none of the models are significantly different from the model with the fewest parameters. There are probably some small gains in the performance of those models, but at an increased risk of over fitting.  So model.4 is the winner, and we deem pdo.Mar, ao.current, ow.Dec, and pdo.Dec to be the best predictors of the response variable.  For those of you who are curious, that’s the March index for the Pacific Decadal Oscillation, the index for the Antarctic Oscillation when the cruise was taking place, within-pack ice open water extent in December (the month prior to the cruise), and the Pacific Decadal Oscillation index in December.  Let’s take a look at the model:

> summary(model.4)

Call:
lm(formula = response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec, 
    data = env.params)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.177243 -0.037756 -0.003996  0.059606  0.114155 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.586e-02  5.893e-02  -0.609 0.552563    
pdo.Mar     -9.526e-02  2.208e-02  -4.315 0.000713 ***
ao.current  -3.127e-02  1.590e-02  -1.967 0.069308 .  
ow.Dec       4.516e-06  1.452e-06   3.111 0.007663 ** 
pdo.Dec     -8.264e-02  2.172e-02  -3.804 0.001936 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08393 on 14 degrees of freedom
Multiple R-squared:  0.8287,	Adjusted R-squared:  0.7797 
F-statistic: 16.93 on 4 and 14 DF,  p-value: 2.947e-05

You’ll recall that the null model accounted for 74 % of the variance in the response variable, our final model accounts for 78 % and we’ve dropped several parameters.  Not bad!  Note that we never could have gotten to this point however, without the holistic search provided by glmnet.

I can still hear some grumbling about over fitting however, so let’s try to address that by bootstrapping.  We are crazy data-limited with only 19 observations, but let’s iteratively hold three observations back at random, build a model with the remaining 16 (using our identified parameters), and see how well each model predicts the three that were held back.  I created a function to do this so that I could repeat this exercise for lots of different response variables.

## The function requires as input the response vector, a vector of the predictors,
## the data frame of predictors, and the number of iterations you want to run.

bootstrap.model <- function(response.vector, predictors, data, n){
  
  predictor.df <- as.data.frame(data[,predictors])
  colnames(predictor.df) <- predictors
  model.predictions <- matrix(ncol = 3, nrow = 0)
  colnames(model.predictions) <- c('year', 'prediction', 'real')
  
  ## How many observations you need to withhold is determined by how many
  ## iterations you want to run.  For example, for 1000 iterations of
  ## unique random subsets of 19 observations, you need to withold 3.
  
  x <- 0
  boot <- 0
  
  for(y in (length(response.vector) - 1):1){
    while(x < n){
      boot <- boot + 1
      x <- y ** boot
      print(x)
    }
  }
  
  ## Now build n new models.
  
  for(i in 1:n){
    print(i)
    predict.i <- sample.int(length(response.vector), boot)
    train.i <- which(!1:length(response.vector) %in% predict.i)
    
    temp.model <- lm(response.vector ~ ., data = predictor.df, subset = train.i)
    
    new.data <- as.data.frame(predictor.df[predict.i,])
    colnames(new.data) <- predictors
    
    temp.predict <- predict(temp.model, newdata = new.data, type = 'response')
    temp.actual <- response.vector[predict.i]
    
    temp.out <- matrix(ncol = 3, nrow = boot)
    try({temp.out[,1] <- names(response.vector)[predict.i]}, silent = T)
    temp.out[,2] <- temp.predict
    temp.out[,3] <- temp.actual
    
    model.predictions <- rbind(model.predictions, temp.out)
    
  }
  
  mean.predicted <- as.data.frame(tapply(as.numeric(model.predictions[,2]), as.character(model.predictions[,3]), mean))
  sd.predicted <- as.data.frame(tapply(as.numeric(model.predictions[,2]), as.character(model.predictions[,3]), sd))
  
  ## Make an awesome plot.

  plot(mean.predicted[,1] ~ as.numeric(row.names(mean.predicted)),
     ylab = 'Predicted response',
     xlab = 'Observed response',
     pch = 19)
  
  for(r in row.names(mean.predicted)){
    lines(c(as.numeric(r), as.numeric(r)), c(mean.predicted[r,1], mean.predicted[r,1] + sd.predicted[r,1]))
    lines(c(as.numeric(r), as.numeric(r)), c(mean.predicted[r,1], mean.predicted[r,1] - sd.predicted[r,1]))
  }
  
  abline(0, 1, lty = 2)
  
  mean.lm <- lm(mean.predicted[,1] ~ as.numeric(row.names(mean.predicted)))
  abline(mean.lm)
  print(summary(mean.lm))
  
  ## Capture stuff that might be useful later in a list.
  
  list.out <- list()
  list.out$model.predictions <- model.predictions
  list.out$mean.lm <- mean.lm
  list.out$mean.predicted <- mean.predicted
  list.out$sd.predicted <- sd.predicted
  
  return(list.out)
  
}

## And execute the function...

demo.boostrap <- bootstrap.model(response, c('pdo.Mar', 'ao.current', 'ow.Dec', 'pdo.Dec'), env.params, 1000)

Hey, that looks pretty good!  The dotted line is 1:1, while the solid line gives the slope of the regression between predicted and observed values.  The lines through each point give the standard deviation of the predictions for that point across all iterations.  There’s scatter here, but the model predicts low values for low observations, and high for high, so it’s a start…

Posted in Uncategorized | 3 Comments