Tutorial: Self Organizing Maps in R

Self-organizing maps (SOMs) are a form of neural network and a wonderful way to partition complex data.  In our lab they’re a routine part of our flow cytometry and sequence analysis workflows, but we use them for all kinds of environmental data (like this).  All of the mainstream data analysis languages (R, Python, Matlab) have packages for training and working with SOMs.  My favorite is the R package Kohonen, which is simple to use but can support some fairly complex analysis through SOMs with multiple data layers and supervised learning (superSOMs).  The Kohonen package has a nice, very accessible paper that describe its key features, and some excellent examples.  This tutorial applies our basic workflow for a single-layer SOM to RGB color data.  RGB color space segmentation is a popular way to evaluate machine learning algorithms, as it is intrinsically multi-variate and inherently meaningful.  Get like colors grouping together and you know that you’ve set things up correctly!

This application of SOMs has two steps.  Each of these steps can be thought of as an independent data reduction step.  It’s important to remember that you’re not reducing dimensions per se, as you would in a PCA, you’re aggregating like data so that you can describe them as convenient units (instead of n individual observations).  The final outcome, however, represents a reduction in dimensionality to a single parameter for all observations (e.g., the color blue instead of (0, 0, 255) in RGB colorspace).  The first step – training the SOM – assigns your observations to map units.  The second step – clustering the map units into classes – finds the structure underlying the values associated with the map units after training.  At the end of this procedure each observation belongs to a map unit, and each map unit belongs to a class.  Thus each observation inherits the class of its associated map unit.  If that’s not clear don’t sweat it.  It will become clear as you go through the procedure.

First, let’s generate some random RGB data.  This takes the form of a three column matrix where each row is a pixel (i.e. an observation).

#### generate some RGB data ####

## select the number of random RGB vectors for training data

sample.size <- 10000

## generate dataframe of random RGB vectors

sample.rgb <- as.data.frame(matrix(nrow = sample.size, ncol = 3))
colnames(sample.rgb) <- c('R', 'G', 'B')

sample.rgb$R <- sample(0:255, sample.size, replace = T)
sample.rgb$G <- sample(0:255, sample.size, replace = T)
sample.rgb$B <- sample(0:255, sample.size, replace = T)

Next, we define a map space for the SOM and train the model.  Picking the right grid size for the map space is non-trivial; you want about 5 elements from the training data per map unit, though you’ll likely find that they’re not uniformly distributed.  It’s best to use a symmetrical map unless you have a very small training dataset, hexagonal map units, and a toroidal shape.  The latter is important to avoid edge effects (a toroid has no edges).

One important caveat for the RGB data is that we’re not going to bother with any scaling or normalizing.  The parameters are all on the same scale and evenly distributed between 0 and the max value of 255.  Likely your data are not so nicely formed! 

#### train the SOM ####

## define a grid for the SOM and train

library(kohonen)

grid.size <- ceiling(sample.size ^ (1/2.5))
som.grid <- somgrid(xdim = grid.size, ydim = grid.size, topo = 'hexagonal', toroidal = T)
som.model <- som(data.matrix(sample.rgb), grid = som.grid)

One you’ve trained the SOM it’s a good idea to explore the output of the `som` function to get a feel for the different items in there.  The output takes the form of nested lists.  Here we extract a couple of items that we’ll need later, and also create a distance matrix of the map units.  We can do this because the fundamental purpose of map units is to have a codebook vector that mimics the structure of the training data.  During training each codebook vector is iteratively updated along with its neighbors to match the training data.  After sufficient iterations the codebook vectors reflect the underlying structure of the data.

## extract some data to make it easier to use

som.events <- som.model$codes[[1]]
som.events.colors <- rgb(som.events[,1], som.events[,2], som.events[,3], maxColorValue = 255)
som.dist <- as.matrix(dist(som.events))

Now that we have a trained SOM let’s generate a descriptive plot.  Since the data are RGB colors, if we color the plot accordingly it should be sensible.  For comparison, we first create a plot with randomized codebook vectors.  This represents the SOM at the start of training.

## generate a plot of the untrained data.  this isn't really the configuration at first iteration, but
## serves as an example

plot(som.model,
     type = 'mapping',
     bg = som.events.colors[sample.int(length(som.events.colors), size = length(som.events.colors))],
     keepMargins = F,
     col = NA,
     main = '')

And now the trained SOM:

## generate a plot after training.

plot(som.model,
     type = 'mapping',
     bg = som.events.colors,
     keepMargins = F,
     col = NA,
     main = '')

So pretty! The next step is to cluster the map units into classes.  As with all clustering analysis, a key question is how many clusters (k) should we define?  One way to inform our decision is to evaluate the distance between all items assigned to each cluster for many different values of k.  Ideally, creating a scree plot of mean within-cluster distance vs. k will yield an inflection point that suggests a meaningful value of k.  In practice this inflection point is extremely sensitive to the size of the underlying data (in this case, the number of map units), however, it can be a useful starting place.  Consider that the RGB data were defined continuously, meaning that there is no underlying structure!  Nonetheless we still get an inflection point.

#### look for a reasonable number of clusters ####

## Evaluate within cluster distances for different values of k.  This is
## more dependent on the number of map units in the SOM than the structure
## of the underlying data, but until we have a better way...

## Define a function to calculate mean distance within each cluster.  This
## is roughly analogous to the within clusters ss approach

clusterMeanDist <- function(clusters){
  cluster.means = c()
  
  for(c in unique(clusters)){
    temp.members <- which(clusters == c)
    
    if(length(temp.members) > 1){
      temp.dist <- som.dist[temp.members,]
      temp.dist <- temp.dist[,temp.members]
      cluster.means <- append(cluster.means, mean(temp.dist))
    }else(cluster.means <- 0)
  }
  
  return(mean(cluster.means))
  
}

try.k <- 2:100
cluster.dist.eval <- as.data.frame(matrix(ncol = 3, nrow = (length(try.k))))
colnames(cluster.dist.eval) <- c('k', 'kmeans', 'hclust')

for(i in 1:length(try.k)) {
  cluster.dist.eval[i, 'k'] <- try.k[i]
  cluster.dist.eval[i, 'kmeans'] <- clusterMeanDist(kmeans(som.events, centers = try.k[i], iter.max = 20)$cluster)
  cluster.dist.eval[i, 'hclust'] <- clusterMeanDist(cutree(hclust(vegdist(som.events)), k = try.k[i]))
}

plot(cluster.dist.eval[, 'kmeans'] ~ try.k,
     type = 'l')

lines(cluster.dist.eval[, 'hclust'] ~ try.k,
      col = 'red')

legend('topright',
       legend = c('k-means', 'hierarchical'),
       col = c('black', 'red'),
       lty = c(1, 1))

Having picked a reasonable value for k (let’s say k = 20) we can evaluate different clustering algorithms.  For our data k-means almost always performs best, but you should choose what works best for your data.  Here will evaluate k-means, hierarchical clustering, and model-based clustering.  What we’re looking for in the plots is a clustering method that produces contiguous classes.  If classes are spread all across the map, then the clustering algorithm isn’t capturing the structure of the SOM well.

#### evaluate clustering algorithms ####

## Having selected a reasonable value for k, evaluate different clustering algorithms.

library(pmclust)

## Define a function for make a simple plot of clustering output.
## This is the same as previousl plotting, but we define the function
## here as we wanted to play with the color earlier.

plotSOM <- function(clusters){
  plot(som.model,
       type = 'mapping',
       bg = som.events.colors,
       keepMargins = F,
       col = NA)
  
  add.cluster.boundaries(som.model, clusters)
}

## Try several different clustering algorithms, and, if desired, different values for k

cluster.tries <- list()

for(k in c(20)){
  
  ## model based clustering using pmclust
  
  som.cluster.pm.em <- pmclust(som.events, K = k, algorithm = 'em')$class # model based
  som.cluster.pm.aecm <- pmclust(som.events, K = k, algorithm = 'aecm')$class # model based
  som.cluster.pm.apecm <- pmclust(som.events, K = k, algorithm = 'apecm')$class # model based
  som.cluster.pm.apecma <- pmclust(som.events, K = k, algorithm = 'apecma')$class # model based
  som.cluster.pm.kmeans <- pmclust(som.events, K = k, algorithm = 'kmeans')$class # model based
  
  ## k-means clustering
  
  som.cluster.k <- kmeans(som.events, centers = k, iter.max = 100, nstart = 10)$cluster # k-means
  
  ## hierarchical clustering
  
  som.dist <- dist(som.events) # hierarchical, step 1
  som.cluster.h <- cutree(hclust(som.dist), k = k) # hierarchical, step 2
  
  ## capture outputs
  
  cluster.tries[[paste0('som.cluster.pm.em.', k)]] <- som.cluster.pm.em
  cluster.tries[[paste0('som.cluster.pm.aecm.', k)]] <- som.cluster.pm.aecm
  cluster.tries[[paste0('som.cluster.pm.apecm.', k)]] <- som.cluster.pm.apecm
  cluster.tries[[paste0('som.cluster.pm.apecma.', k)]] <- som.cluster.pm.apecma
  cluster.tries[[paste0('som.cluster.pm.kmeans.', k)]] <- som.cluster.pm.kmeans
  cluster.tries[[paste0('som.cluster.k.', k)]] <- som.cluster.k
  cluster.tries[[paste0('som.cluster.h.', k)]] <- som.cluster.h
}

## Take a look at the various clusters.  You're looking for the algorithm that produces the
## least fragmented clusters.

plotSOM(cluster.tries$som.cluster.pm.em.20)
plotSOM(cluster.tries$som.cluster.pm.aecm.20)
plotSOM(cluster.tries$som.cluster.pm.apecm.20)
plotSOM(cluster.tries$som.cluster.pm.apecma.20)
plotSOM(cluster.tries$som.cluster.k.20)
plotSOM(cluster.tries$som.cluster.h.20)

For brevity I’m not showing the plots produced for all the different clustering algorithms. For these data the k-means and hierarchical clustering algorithms both look pretty good, I have a slight preference for the k-means version:

The SOM and final map unit clustering represent a classification model that can be saved for use with later data.  Once huge advantage to using SOMs over other analysis methods (e.g., ordination techniques) is their usefulness for organizing newly collected data.  New data, if necessary scared and normalized in the same way as the training data, can be classified by finding the map unit with the minimum distance to the new observation.  To demonstrate this, we’ll generate and classify a small new RGB dataset (in reality classifying in this way is very efficient, and could accommodate a huge number of new observations).  First, we save the SOM and final clustering.

## The SOM and map unit clustering represent a classification model.  These can be saved for
## later use.

som.cluster <- som.cluster.k
som.notes <- c('Clustering based on k-means')

save(file = 'som_model_demo.Rdata', list = c('som.cluster', 'som.notes', 'som.model', 'som.events.colors'))

Then we generate new RGB data, classify it, and make a plot to compare the original data, the color of the winning map unit, and the color of the cluster that map unit belongs to.

#### classification ####

## make a new dataset to classify ##

new.data.size <- 20
new.data <- as.data.frame(matrix(nrow = new.data.size, ncol = 3))
colnames(new.data) <- c('R', 'G', 'B')

new.data$R <- sample(0:255, new.data.size, replace = T)
new.data$G <- sample(0:255, new.data.size, replace = T)
new.data$B <- sample(0:255, new.data.size, replace = T)

## get the closest map unit to each point

new.data.units <- map(som.model, newdata = data.matrix(new.data))

## get the classification for closest map units

new.data.classes <- som.cluster[new.data.units$unit.classif]

## compare colors of the new data, unit, and class, first define a function
## to calculate the mean colors for each cluster

clusterMeanColor <- function(clusters){
  cluster.means = c()
  som.codes <- som.model$codes[[1]]
  
  for(c in sort(unique(clusters))){
    temp.members <- which(clusters == c)
    
    if(length(temp.members) > 1){
      temp.codes <- som.codes[temp.members,]
      temp.means <- colMeans(temp.codes)
      temp.col <- rgb(temp.means[1], temp.means[2], temp.means[3], maxColorValue = 255)
      cluster.means <- append(cluster.means, temp.col)
    }else({
      temp.codes <- som.codes[temp.members,]
      temp.col <- rgb(temp.codes[1], temp.codes[2], temp.codes[3], maxColorValue = 255)
      cluster.means <- append(cluster.means, temp.col)
      })
  }
  
  return(cluster.means)
  
}

class.colors <- clusterMeanColor(som.cluster)

plot(1:length(new.data$R), rep(1, length(new.data$R)),
     col = rgb(new.data$R, new.data$G, new.data$B, maxColorValue = 255),
     ylim = c(0,4),
     pch = 19,
     cex = 3,
     xlab = 'New data',
     yaxt = 'n',
     ylab = 'Level')

axis(2, at = c(1, 2, 3), labels = c('New data', 'Unit', 'Class'))

points(1:length(new.data.units$unit.classif), rep(2, length(new.data.units$unit.classif)),
     col = som.events.colors[new.data.units$unit.classif],
     pch = 19,
     cex = 3)

points(1:length(new.data.classes), rep(3, length(new.data.classes)),
       col = class.colors[new.data.classes],
       pch = 19,
       cex = 3)

Looks pretty good! Despite defining only 20 classes, class seems to be a reasonable representation of the original data. Only slight differences in color can be observed between the data, winning map unit, and class.

93335 Total Views 8 Views Today
This entry was posted in Computer tutorials. Bookmark the permalink.

19 Responses to Tutorial: Self Organizing Maps in R

  1. Matt says:

    First, super-helpful, and I really appreciate having this page and code to work through, as I am researching this.

    One question on `clusterMeanDist()`… maybe it doesn’t matter, but when computing cluster means with `mean(temp.dist)`, that is taking the means of a symmetric matrix with a diagonal of zeros. Would it be better to take the means of the unique pairs of distances and do not include the zero self-distances? Or does the general trend hold anyway? I was thinking it might matter if the clusters are very imbalanced…

  2. David Covell says:

    Thanks for the tutorial. pmclust has a dependency on pbdMPI. However pbdMPI
    fails to load. Any suggestions?

    > library(pmclust)
    Loading required package: pbdMPI
    Error: package or namespace load failed for ‘pbdMPI’:
    .onLoad failed in loadNamespace() for ‘pbdMPI’, details:
    call: inDL(x, as.logical(local), as.logical(now), …)
    error: unable to load shared object ‘C:/Users/covelld/Documents/R/win-library/4.0/pbdMPI/libs/x64/pbdMPI.dll’:
    LoadLibrary failure: The specified module could not be found.

    Error: package ‘pbdMPI’ could not be loaded
    In addition: Warning message:
    package ‘pmclust’ was built under R version 4.0.5
    > install.packages(‘pbdMPI’,dependencies=T)
    Installing package into ‘C:/Users/covelld/Documents/R/win-library/4.0’
    (as ‘lib’ is unspecified)
    trying URL ‘https://cloud.r-project.org/bin/windows/contrib/4.0/pbdMPI_0.4-3.zip’
    Content type ‘application/zip’ length 1043651 bytes (1019 KB)
    downloaded 1019 KB

    package ‘pbdMPI’ successfully unpacked and MD5 sums checked

    The downloaded binary packages are in
    C:\Users\covelld\AppData\Local\Temp\1\Rtmp8iQt9j\downloaded_packages
    > library(pbdMPI)
    Error: package or namespace load failed for ‘pbdMPI’:
    .onLoad failed in loadNamespace() for ‘pbdMPI’, details:
    call: inDL(x, as.logical(local), as.logical(now), …)
    error: unable to load shared object ‘C:/Users/covelld/Documents/R/win-library/4.0/pbdMPI/libs/x64/pbdMPI.dll’:
    LoadLibrary failure: The specified module could not be found.

    • Avatar photo Jeff says:

      I’d try installing pbdMPI separately. Watch the install closely for warnings and errors.

      • Ananya Ramesh says:

        Thanks for the tutorial, but I am also getting the same error as David. Can you help in fixing it?

        • Avatar photo Jeff says:

          Thanks for the comment Ananya. That’s a package question so I encourage you to reach out to the pmclust and pdbMPI developers if there’s an issue installing those packages.

    • Dominik says:

      Hello,
      as the help site for pbdMPI (?pbdMPI) says, you need some sort of MPI-Library installed. For me, Windows 10, the current version of MS-MPI solved the loading problem and the code runs fine. If your OS differs, then you need to search for a compatible library.

      Cheers

  3. Zk says:

    Thank you so much for this great tutorial.
    I am doing SOM on my data and I need to calculate Topographical error. I can’t seem to understand how to do that.
    Could you please add that?
    Kind regards

    • Avatar photo Jeff says:

      Zk, given time constraints we’re not in a position to customize the tutorial, but if you figure it out I encourage you to post your solution! I’m not familiar with that term though. By topographical error do you mean the distance between observations and the final codebook vector?

  4. pankaj medhi says:

    this tutorial provided me great help for prediction of classes for new data

  5. Sarah Amiri says:

    Thanks Bowman Lab for this tutorial! This helped greatly.

  6. Peter says:

    Dear Jeff
    I really like your example. How can I limit the number of iterations in the SOM? I would like to make an animation that shows how the SOM arranges the different colors in e.g. 100 steps. Therefore, I need to interrupt the iteration process after e.g. 10, 20, 30… steps. Thanks for your insight!

  7. Alice says:

    Hi,
    Your tutorial is helpful.
    Sadly the function “pmclust” has been removed from R, any other suggestions?
    Thanks.

  8. David says:

    install.packages(“vegan”)
    library(vegan)

    Should fix your vegdist problem

  9. Pasquale says:

    Hello.
    Your script give me an error.
    In your script the funcion “vegdist” is not defined.
    Could you add, please ?
    Sincerely

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield