Very often environmental datasets contain far fewer observations than we would like, and far more variables that might influence these observations. This is the case for one of my projects in the Palmer LTER: I have annual observations of an environmental indicator (n = 19) and far more than 19 environmental conditions (predictors, p) that might influence the indicator. This is a broad dataset (n >> p), and the problem can’t be solved with classic multiple regression (for which the number of observations must exceed the number of predictors). Enter the R package glmnet. Glmnet is an implementation of lasso, ridge, and elastic-net regression. There are a limited number of glmnet tutorials out there, including this one, but I couldn’t find one that really provided a practical start to end guide. Here’s the method that I came up with for using glmnet to select variables for multiple regression. I’m not an expert in glmnet, so don’t take this as a definitive guide. Comments are welcome!
First, let’s get some data. The data are various climate indices with a 13- month time lag as well as sea ice extent, open water extent, and fractional open water extent. There are a couple of years missing from this dataset, and we need to remove one more year for reasons that aren’t important here.
env.params <- read.csv('https://www.polarmicrobes.org/extras/env_params.csv', header = T, row.names = 1) env.params <- env.params[row.names(env.params) != '1994',]
To keep things simple we’ll create our response variable by hand. Nevermind what the response variable is for now, you can read our paper when it comes out 🙂
response <- c(0.012, 0.076, 0.074, 0.108, 0.113, 0.154, 0.136, 0.183, 0.210, 0.043, 0.082, 0.092, 0.310, 0.185, 0.436, 0.357, 0.472, 0.631, 0.502)
One thing to note at this point is that, because we have so few observations, we can’t really withhold any to cross-validate the glmnet regression. That’s a shame because it means that we can’t easily optimize the alpha parameter (which determines whether glmnet uses lasso, elastic net, or ridge) as was done here. Instead we’ll use alpha = 0.9. This is to make the analysis a little bit elastic-netty, but still interpretable.
In my real analysis I was using glmnet to identify predictors for lots of response variables. It was therefor useful to write a function to generalize the several commands needed for glmnet. Here’s the function:
library(glmnet) ## The function requires a matrix of possible predictors, a vector with the response variable, ## the GLM family used for the model (e.g. 'gaussian'), the alpha parameter, and "type.measure". ## See the documentation on cv.glmnet for options. Probably you want "deviance". get.glmnet <- function(predictors, response, family, alpha, type.measure){ glmnet.out <- glmnet(predictors, response, family = family, alpha = alpha) glmnet.cv <- cv.glmnet(predictors, response, family = family, alpha = alpha, type.measure = type.measure) ## Need to find the local minima for lambda. lambda.lim <- glmnet.cv$lambda[which.min(glmnet.cv$cvm)] ## Now identify the coefficients that correspond to the lambda minimum. temp.coefficients.i <- coefficients(glmnet.out, s = lambda.lim)@i + 1 # +1 to account for intercept ## And the parameter names... temp.coefficients.names <- coefficients(glmnet.out, s = lambda.lim)@Dimnames[[1]][temp.coefficients.i] temp.coefficients <- coefficients(glmnet.out, s = lambda.lim)@x ## Package for export. temp.coefficients <- rbind(temp.coefficients.names, temp.coefficients) return(temp.coefficients) }
Phew! Okay, let’s try to do something with this. Note that glmnet requires a matrix as input, not a dataframe.
response.predictors <- get.glmnet(data.matrix(env.params), response, 'gaussian', alpha, 'deviance')
>t(response.predictors)
temp.coefficients.names temp.coefficients
[1,] "(Intercept)" "0.496410195525042"
[2,] "ao.Apr" "0.009282064516813"
[3,] "ao.current" "-0.0214919836174853"
[4,] "pdo.Mar" "-0.0568728879266135"
[5,] "pdo.Aug" "-0.00881845994191182"
[6,] "pdo.Dec" "-0.0321738320415234"
[7,] "ow.Dec" "1.92231198892721e-06"
[8,] "ow.frac.current" "0.207945851122607"
[9,] "ice.Sep" "-2.29621552264475e-06"
So glmnet thinks there are 9 predictors. Possible, but I’m a little suspicious. I’d like to see this in the more familiar GLM format so that I can wrap my head around the significance of the variables. Let’s start by building a null model of all the predictors.
null.model <- lm(response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep, data = env.params)
> summary(null.model) Call: lm(formula = response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep, data = env.params) Residuals: Min 1Q Median 3Q Max -0.14304 -0.03352 -0.01679 0.05553 0.13497 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.052e+00 1.813e+00 1.132 0.2841 ao.Apr 2.226e-02 3.702e-02 0.601 0.5610 ao.current -3.643e-02 1.782e-02 -2.044 0.0681 . pdo.Mar -7.200e-02 3.215e-02 -2.240 0.0490 * pdo.Aug -1.244e-02 2.948e-02 -0.422 0.6821 pdo.Dec -6.072e-02 3.664e-02 -1.657 0.1285 ow.Dec 3.553e-06 1.749e-06 2.032 0.0696 . ow.frac.current 1.187e-01 4.807e-01 0.247 0.8100 ice.Sep -1.023e-05 8.558e-06 -1.195 0.2596 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09043 on 10 degrees of freedom Multiple R-squared: 0.8579, Adjusted R-squared: 0.7443 F-statistic: 7.549 on 8 and 10 DF, p-value: 0.002224
Hmmm… this looks messy. There’s only one predictor with a slope significantly different from 0. A high amount of variance in the original data is accounted for, but it smells like overfitting to me. Let’s QC this model a bit by 1) checking for autocorrelations, 2) using AIC and relative likelihood to further eliminate predictors, and 3) selecting the final model with ANOVA.
## Check for autocorrelations using variance inflation factors library(car)
> vif(null.model) ao.Apr ao.current pdo.Mar pdo.Aug pdo.Dec ow.Dec ow.frac.current 1.610203 1.464678 1.958998 2.854688 2.791749 1.772611 1.744653 ice.Sep 1.510852
Surprisingly, all the parameters have acceptable vif scores (vif < 5). Let’s proceed with relative likelihood.
## Define a function to evaluate relative likelihood. rl <- function(aicmin, aici){ return(exp((aicmin-aici) / 2)) } ## Construct multiple possible models, by adding parameters by order of abs(t value) in summary(null.lm). model.1 <- lm(response ~ pdo.Mar, data = env.params) model.2 <- lm(response ~ pdo.Mar + ao.current, data = env.params) model.3 <- lm(response ~ pdo.Mar + ao.current + ow.Dec, data = env.params) model.4 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec, data = env.params) model.5 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep, data = env.params) model.6 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr, data = env.params) model.7 <- lm(response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr + pdo.Aug, data = env.params) ## Collect AIC scores for models. model.null.aic <- AIC(null.model) model.1.aic <- AIC(model.1) model.2.aic <- AIC(model.2) model.3.aic <- AIC(model.3) model.4.aic <- AIC(model.4) model.5.aic <- AIC(model.5) model.6.aic <- AIC(model.6) model.7.aic <- AIC(model.7) ## Identify the model with the lowest AIC score.
> which.min(c(model.1.aic, model.2.aic, model.3.aic, model.4.aic, model.5.aic, model.6.aic, model.7.aic, model.null.aic)) [1] 5
So model.5 has the lowest AIC. We need to check the relative likelihood of other models minimizing information loss. Models with values < 0.05 do not have a significant likelihood of minimizing information loss and can be discarded.
> rl(model.5.aic, model.1.aic) [1] 8.501094e-05 > rl(model.5.aic, model.1.aic) [1] 8.501094e-05 > rl(model.5.aic, model.2.aic) [1] 0.00143064 > rl(model.5.aic, model.3.aic) [1] 0.002415304 > rl(model.5.aic, model.4.aic) [1] 0.7536875 > rl(model.5.aic, model.6.aic) [1] 0.4747277 > rl(model.5.aic, model.7.aic) [1] 0.209965 > rl(model.5.aic, model.null.aic) [1] 0.08183346
Excellent, we can discard quite a few possible mode here; model.1, model.2, and model.3. Last we’ll use ANOVA and the chi-squared test to see if any of the models are significantly different from model.4, the model with the fewest parameters.
> anova(model.4, model.5, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 13 0.086169 1 0.012457 0.1704 > anova(model.4, model.6, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 12 0.083887 2 0.01474 0.3485 > anova(model.4, model.7, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec + ice.Sep + ao.Apr + pdo.Aug Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 11 0.082276 3 0.01635 0.5347 > anova(model.4, null.model, test = 'Chisq') Analysis of Variance Table Model 1: response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec Model 2: response ~ ao.Apr + ao.current + pdo.Mar + pdo.Aug + pdo.Dec + ow.Dec + ow.frac.current + ice.Sep Res.Df RSS Df Sum of Sq Pr(>Chi) 1 14 0.098626 2 10 0.081778 4 0.016849 0.7247
Okay, there’s a whole lot going on there, they important thing is that none of the models are significantly different from the model with the fewest parameters. There are probably some small gains in the performance of those models, but at an increased risk of over fitting. So model.4 is the winner, and we deem pdo.Mar, ao.current, ow.Dec, and pdo.Dec to be the best predictors of the response variable. For those of you who are curious, that’s the March index for the Pacific Decadal Oscillation, the index for the Antarctic Oscillation when the cruise was taking place, within-pack ice open water extent in December (the month prior to the cruise), and the Pacific Decadal Oscillation index in December. Let’s take a look at the model:
> summary(model.4) Call: lm(formula = response ~ pdo.Mar + ao.current + ow.Dec + pdo.Dec, data = env.params) Residuals: Min 1Q Median 3Q Max -0.177243 -0.037756 -0.003996 0.059606 0.114155 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.586e-02 5.893e-02 -0.609 0.552563 pdo.Mar -9.526e-02 2.208e-02 -4.315 0.000713 *** ao.current -3.127e-02 1.590e-02 -1.967 0.069308 . ow.Dec 4.516e-06 1.452e-06 3.111 0.007663 ** pdo.Dec -8.264e-02 2.172e-02 -3.804 0.001936 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.08393 on 14 degrees of freedom Multiple R-squared: 0.8287, Adjusted R-squared: 0.7797 F-statistic: 16.93 on 4 and 14 DF, p-value: 2.947e-05
You’ll recall that the null model accounted for 74 % of the variance in the response variable, our final model accounts for 78 % and we’ve dropped several parameters. Not bad! Note that we never could have gotten to this point however, without the holistic search provided by glmnet.
I can still hear some grumbling about over fitting however, so let’s try to address that by bootstrapping. We are crazy data-limited with only 19 observations, but let’s iteratively hold three observations back at random, build a model with the remaining 16 (using our identified parameters), and see how well each model predicts the three that were held back. I created a function to do this so that I could repeat this exercise for lots of different response variables.
## The function requires as input the response vector, a vector of the predictors, ## the data frame of predictors, and the number of iterations you want to run. bootstrap.model <- function(response.vector, predictors, data, n){ predictor.df <- as.data.frame(data[,predictors]) colnames(predictor.df) <- predictors model.predictions <- matrix(ncol = 3, nrow = 0) colnames(model.predictions) <- c('year', 'prediction', 'real') ## How many observations you need to withhold is determined by how many ## iterations you want to run. For example, for 1000 iterations of ## unique random subsets of 19 observations, you need to withold 3. x <- 0 boot <- 0 for(y in (length(response.vector) - 1):1){ while(x < n){ boot <- boot + 1 x <- y ** boot print(x) } } ## Now build n new models. for(i in 1:n){ print(i) predict.i <- sample.int(length(response.vector), boot) train.i <- which(!1:length(response.vector) %in% predict.i) temp.model <- lm(response.vector ~ ., data = predictor.df, subset = train.i) new.data <- as.data.frame(predictor.df[predict.i,]) colnames(new.data) <- predictors temp.predict <- predict(temp.model, newdata = new.data, type = 'response') temp.actual <- response.vector[predict.i] temp.out <- matrix(ncol = 3, nrow = boot) try({temp.out[,1] <- names(response.vector)[predict.i]}, silent = T) temp.out[,2] <- temp.predict temp.out[,3] <- temp.actual model.predictions <- rbind(model.predictions, temp.out) } mean.predicted <- as.data.frame(tapply(as.numeric(model.predictions[,2]), as.character(model.predictions[,3]), mean)) sd.predicted <- as.data.frame(tapply(as.numeric(model.predictions[,2]), as.character(model.predictions[,3]), sd)) ## Make an awesome plot. plot(mean.predicted[,1] ~ as.numeric(row.names(mean.predicted)), ylab = 'Predicted response', xlab = 'Observed response', pch = 19) for(r in row.names(mean.predicted)){ lines(c(as.numeric(r), as.numeric(r)), c(mean.predicted[r,1], mean.predicted[r,1] + sd.predicted[r,1])) lines(c(as.numeric(r), as.numeric(r)), c(mean.predicted[r,1], mean.predicted[r,1] - sd.predicted[r,1])) } abline(0, 1, lty = 2) mean.lm <- lm(mean.predicted[,1] ~ as.numeric(row.names(mean.predicted))) abline(mean.lm) print(summary(mean.lm)) ## Capture stuff that might be useful later in a list. list.out <- list() list.out$model.predictions <- model.predictions list.out$mean.lm <- mean.lm list.out$mean.predicted <- mean.predicted list.out$sd.predicted <- sd.predicted return(list.out) } ## And execute the function... demo.boostrap <- bootstrap.model(response, c('pdo.Mar', 'ao.current', 'ow.Dec', 'pdo.Dec'), env.params, 1000)
Hey, that looks pretty good! The dotted line is 1:1, while the solid line gives the slope of the regression between predicted and observed values. The lines through each point give the standard deviation of the predictions for that point across all iterations. There’s scatter here, but the model predicts low values for low observations, and high for high, so it’s a start…
The data and approach this tutorial references were published as: Bowman, J.S., Kavanaugh, M.T., Doney, S.C. and Ducklow, H.W., 2018. Recurrent seascape units identify key ecological processes along the western Antarctic Peninsula. Global change biology, 24(7), pp.3065-3078. (https://onlinelibrary.wiley.com/doi/full/10.1111/gcb.14161)
You’re welcome! Glad it was useful.
VERY helpful! I’m so glad I found this. Thank you for taking the time to share this to help others learning to use glmnet.