Note this is part two of a series on clustering RNAseq data. Check out part one on hierarcical clustering here and part three on fuzzy c-means clustering here.

Clustering gene expression data allows us to identify substructures in the data and identify groups of genes that behave similarly. This method can help us identify genes that share a biological function (co-functional) and genes that are under the same control logic (co-regulated).

The ‘correct’ method for clustering RNAseq data is a matter of perspective; it is the one that allows the researcher to make the most out of her data. Gene expression data is also full of noise which can make clustering tricky when using algorithms optimised for chunkier data. With that in mind it’s good to try several methods and compare them. In part one of this series we went through how to cluster RNAseq data using hierarchical clustering and using tree cutting. Here we will discuss how to use K-means clustering to cluster RNAseq data. We will also review methods to determine the optimum number of clusters.

K-means clustering is fundamentally different from hierarchical clustering in that it is a form of partitional clustering in which the data are divied into K partitions. This requires a priori knowledge about the optimum number of partitions (clusters), thus the classic question ‘How do i determine the correct number of clusters?’ which we cover below. In K-means the data is seeded with K random datapoints. Then the data closest to those points are assigned to that ‘cluster’, the mean of those data is calculated and that mean becomes the new cluster center. The data are then re-assigned to their nearest point and the whole process repeats until the clusters are stable. Still confused? Check out this awesome gif from Andrey Shabalin:

Let’s put this into practice. The first thing we need is some test data. Here I’m downloading a summarized experitment from the recount project. It describes MCF-7 cells treated with 10nM 17b-E2 for nine time points (5’, 10’, 20’, 40’, 80’, 160’, 320’, 640’ and 1280’). We then normalize in edgeR:

counts <- assays(rse_gene)$counts
y <- as.matrix((counts))
y <- DGEList(counts = y, group=c(1,2,3,4,5,6,7,8,9,10))
y <- calcNormFactors(y)
z <- cpm(y, normalized.lib.size=TRUE)

Should you filter the data?

Gene expression data is notoriously noisy. Since K-means is partitional clustering, it forces every gene into a cluster regardless of how well the data actually fit. Noisey genes can then be assigned to clusters even if their expression pattern isn’t representative of the cluster. For that reason I recommend filtering based on mean expression (drop low counts), variance, FC, or anything else to get rid of background genes. For this tutorial I am taking the most dynamically expressed genes, those which have the highest variance and expression. I am also only taking the last 6 timepoints since the first 4 weren’t that interesting when I first did this analysis (I wouldn’t do this for a real analysis!!):

#mean/variance calculations
z_var <- apply(z, 1, var)
z_mean <- apply(z, 1, mean)
plot(log2(z_mean), log2(z_var), pch='.')
abline(h=log2(50), col='red')
abline(v=log2(50), col='red')
text(x=13,y=23, labels="variance > 50 &\n mean > 50", col='red')
z <- z[which(z_var > 50 & z_mean > 50), 6:10]

The data we take is in the upper right hand corner.

Before clustering we need to scale the data. This is so that we can identify clusters of genes that share similar expression profiles rather than similar expression levels.

scaledata <- t(scale(t(z))) # Centers and scales data.

We can also cluster the samples to identify outliers:

hc <- hclust(as.dist(1-cor(scaledata, method="spearman")), method="complete") # Clusters columns by Spearman correlation.
TreeC = as.dendrogram(hc, method="average")
     main = "Sample Clustering",
     ylab = "Height")

As you can see in general the samples cluster together by timepoint (SRRX0013 - SRRX0016 are enumerated in order of time).

K-Means: How many clusters?

As discussed in the introduction, K-means clustering requires a priori knowledge of the number of clusters. The number of clusters can really impact the classifications so it’s an important consideration. Check out this example from the shiny gallery to see how increasing the cluster number changes the resultant assignments. Cluster number designation is tricky business and for more info check out this awesome stack overflow answer

There are a few methods for evaluating the optimum number of clusters and we’ll see how they apply to our dataset below:


The first measure is using the sum of squared error (SSE). SSE is defined as the sum of the squared distance between each member of a cluster and its cluster centroid. We repeatedly test and increasing number of clusters and evaluate the SSE. As we increase the number of clusters the distance between any point and it’s centroid will be smaller since the cluster itself is smaller. At a certain number of clusters number however, the SSE will not significantly decrease with each new addition of a cluster. This is the elbow and suggests a suitable number of clusters:

wss <- (nrow(scaledata)-1)*sum(apply(scaledata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(scaledata,
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

So by this measure the optimum is 4 clusters

Average silhouette width:

The next method is by estimating the optimum number using the average silhouette width. The silhouette value describes how similar a gene is to its own cluster (cohesion) compared to other clusters (separation). A high value indicates that the gene is well placed. So if the average of all of these silhouettes is high then the number of clusters is good.

sil <- rep(0, 20)
#repeat k-means for 1:20 and extract silhouette:
for(i in 2:20){
  k1to20 <- kmeans(scaledata, centers = i, nstart = 25, iter.max = 20)
  ss <- silhouette(k1to20$cluster, dist(scaledata))
  sil[i] <- mean(ss[, 3])

# Plot the  average silhouette width
plot(1:20, sil, type = "b", pch = 19, xlab = "Number of clusters k", ylab="Average silhouette width")
abline(v = which.max(sil), lty = 2)
cat("Average silhouette width optimal number of clusters:", which.max(sil), "\n")
## Average silhouette width optimal number of clusters: 2

So by this measure the optimum number is 2 clusters. In my experience 2 clusters oversimplifies genes expression data as being either globally down or globally up which for us isn’t that interesting if we want the more interesting patterns. Note the 3 clusters has approximately the same value.

Calinsky criterion:

The Calinski-Harabasz index is based on the intra and inter cluster sum of squares. So we are looking to maximize the index to find well separated clusters. To do this we repeatedly cluster the data then look how the genes partition themselves with increasing K:

fit <- cascadeKM(scaledata, 1, 20, iter = 100)
plot(fit, sortg = TRUE, grpmts.plot = TRUE) <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:",, "\n")
## Calinski criterion optimal number of clusters: 2

Take a look at the plot on the left. It shows how each of the genes are partitioned with an increasing number of clusters. You can see how different sub structures develop as more clusters are added. On the right we see the maximum CH index is for 2 clusters but also that the indices for 3 and 4 are still quite high so those are appropriate.

Gap statistic:

Next up is the gap statistic. The gap statistic comapres the log within-cluster sum of squares (discussed above) with it’s expectation under the null reference distribution. Then it chooses the cluster where the gap between the log(wss) and the maximim of the null ref is the largest:

gap <- clusGap(scaledata, kmeans, 20, B = 100, verbose = interactive())
plot(gap, main = "Gap statistic")
abline(v=which.max(gap$Tab[,3]), lty = 2)

You can see that 6 clusters is the max but 4,5,&7 are within a similar range.

Affinity propogation:

This is a newer method of clustering that I must admit is a bit over my head. It uses representatives from the data called ‘exemplars’ to build the clusters in a way thats similar to partitioning around medoids. It doesn’t require designating a cluster number. You can read more here

Warning the heatmap takes forever to render. I have it commented here and include the image below:

d.apclus <- apcluster(negDistMat(r=2), scaledata)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
## affinity propogation optimal number of clusters: 70
#uncomment the next line for the heatmap:
#heatmap(d.apclus,cexRow=0, cexCol=0)

You can see in the heatmap 4 distinct clusters. The model however thinks there are 70 clusters which is a bit absurd so I’m not sure about this method.

Hierarchical Clustering:

It’s possible to use hierarchical clustering to provide insight into the optimum number of K clusters. The easiest way to do this is to perform the clustering then plot a heatmap and look for the clusters ‘by eye’.

First the matrix:

#make the matrix
dist <- cor(t(scaledata), method="pearson")
#make the tree
hr <- hclust(as.dist(1-dist), method="complete") # Cluster rows by Pearson correlation.

Now we can plot the resulting gene heatmap to see if we can pick out the clusters:

#draw the heatmap
          margins = c(2, 2),
          cexCol = 0.7,
          labRow = F,
          labCol = F,
          main = "Heatmap",
          trace = "none"

We can see more or less 4 clusters so this mostly agrees.

It looks like between 3 and 4 clusters is optimal based on the different methods. For RNAseq data it’s certainly valid to examine higher cluster numbers to see more distinct profiles. Beware though at some point the profiles will become very redundant and the cluster assignments will be a bit ambiguous. Here we’ll continue with 4.

Clustering the data:

Let’s perform the actual clsutering using K=4:

kClust <- kmeans(scaledata, centers=4, nstart = 1000, iter.max = 20)
kClusters <- kClust$cluster

Now we can calculate the cluster ‘cores’ aka centroids:

# function to find centroid in cluster i
clust.centroid = function(i, dat, clusters) {
  ind = (clusters == i)
kClustcentroids <- sapply(levels(factor(kClusters)), clust.centroid, scaledata, kClusters)

Plotting the centroids to see how they behave:

#get in long form for plotting
Kmolten <- melt(kClustcentroids)
colnames(Kmolten) <- c('sample','cluster','value')

p1 <- ggplot(Kmolten, aes(x=sample,y=value, group=cluster, colour=as.factor(cluster))) + 
  geom_point() + 
  geom_line() +
  xlab("Time") +
  ylab("Expression") +
  labs(title= "Cluster Expression by Time",color = "Cluster")

So we have some interesting cluster profiles! If you do this analysis and recover cores that have very similar expression consider reducing your K.

An a posteriori means of cluster validation is to correlate the cluster centroids with each other. If the centroids are too similar then they will have a high correlation. If your K number produces clusters with high correlation (say above 0.85) then consider reducing the number of clusters.

Correlate the medoids to see how similar they are:

##            1          2           3           4
## 1  1.0000000  0.3385156 -0.36325083 -0.87660675
## 2  0.3385156  1.0000000 -0.84261807 -0.11389869
## 3 -0.3632508 -0.8426181  1.00000000 -0.06415127
## 4 -0.8766067 -0.1138987 -0.06415127  1.00000000

Our clusters are very distinct with a maximum correlation of 0.33 between cluster 1 and cluster 2.

Using a cluster score to identify core genes:

Calculating a membership score for clusters can help identify the core genes whose expression closely match the core. These genes might play a role in determining the expression of the cluster as a whole. These scores can also be used to a posteriori filter your genes if you want to drop genes that don’t fit well into any cluster.

To calculate the scores for a single cluster, in this case 2 we’ll extract the core data for cluster 2, then subset the scaled data by cluster =2. Then, we’ll calculate the ‘score’ by correlating each gene with the cluster core. We can then plot the results for each gene with the core overlayed:

#Subset the cores molten dataframe so we can plot the core
core2 <- Kmolten[Kmolten$cluster=="2",]

#get cluster 2
K2 <- (scaledata[kClusters==2,])
#calculate the correlation with the core
corscore <- function(x){cor(x,core2$value)}
score <- apply(K2, 1, corscore)
#get the data frame into long format for plotting
K2molten <- melt(K2)
colnames(K2molten) <- c('gene','sample','value')
#add the score
K2molten <- merge(K2molten,score, by.x='gene',by.y='row.names', all.x=T)
colnames(K2molten) <- c('gene','sample','value','score')
#order the dataframe by score
#to do this first create an ordering factor
K2molten$order_factor <- 1:length(K2molten$gene)
#order the dataframe by score
K2molten <- K2molten[order(K2molten$score),]
#set the order by setting the factors
K2molten$order_factor <- factor(K2molten$order_factor , levels = K2molten$order_factor)

# Everything on the same plot
p2 <- ggplot(K2molten, aes(x=sample,y=value)) + 
  geom_line(aes(colour=score, group=gene)) +
  scale_colour_gradientn(colours=c('blue1','red2')) +
  #this adds the core 
  geom_line(data=core2, aes(sample,value, group=cluster), color="black",inherit.aes=FALSE) +
  xlab("Time") +
  ylab("Expression") +
  labs(title= "Cluster 2 Expression by Time",color = "Score")

In this plot, genes with a profile close to the core have a score approaching 1 (red) while those with divergent patterns have a score closer to 0 (blue). You can see there is some noise but the genes mostly fit the data. If you observe many genes with low scores considering increasing your K as they’ve been ‘forced’ into a cluster in which they don’t belong. Too much noise in the data can also lead to low scoring genes.

Comparing cluster methods:

Let’s see how this compares to hierarchical clustering!

Now if you’ve been following since part one on hierarchical clustering you might remember we can cut the tree and recover a pre-determined number of clusters (rather than cutting the tree based on height). We do that here asking for 4 clusters to compare to the K-means:

#we made the hr and TreeR objects above.
hclustk4 = cutree(hr, k=4) #cut tree to find 4 clusters
TreeR = as.dendrogram(hr, method="complete")
     leaflab = "none",
     main = "Gene Clustering",
     ylab = "Height")
#this plots the bar below:
colored_bars(hclustk4, TreeR, sort_by_labels_order = T, y_shift=-0.1, rowLabels = c("k=4"),cex.rowLabels=0.7)

Looks reasonable since the clusters re-group the main branches.

Now let’s see how the K-means cluster assignments line up with the hierarchical clustering.

     leaflab = "none",
     main = "Gene Clustering",
     ylab = "Height")
the_bars <- cbind(hclustk4, kClusters)
colored_bars(the_bars, TreeR, sort_by_labels_order = T, y_shift=-0.1, rowLabels = c("Treecut",'K-means'),cex.rowLabels=0.7)

We observe some differences especially along the longer branches but the lower branches correspond well.

One last comparison, and this can be applied to looking at any set of cluster assignments, is to check for cluster membership overlap. This will compare the genes which comprise each cluster in one set of clusters and show us if they also make cluster together in a second set. Well matched clusters between assignments will have many overlapping genes. This is also a good way to identify analagous cluster assignments in two sets of clusters. Here we’ll compare the assignments from hierarchical clustering to the K-means assignments and see how they line up:

#these functions from the WCGNA package are great for this:
hclustk4 <- paste0('H-',hclustk4)
kClusters <- paste0('K-',kClusters)
OT<- overlapTable(hclustk4, kClusters)
#get rid of 0 values...
OT$pTable[OT$pTable == 0] <- 2e-300

textMatrix= paste(signif(OT$countTable, 2), "\n(",
                      signif(OT$pTable, 1), ")", sep= "")
dim(textMatrix)= dim(OT$countTable)
heatmap.2(x= -log(OT$pTable),
          dendrogram = "none",
          Colv =F,
          Rowv = F,
          scale = c("none"),
          cellnote = textMatrix,
          cexRow = 0.8,
          cexCol = 0.8,
          main = "Cluster-Cluster Overlap",
          xlab = "K-means (k=3)",
          ylab = "TreeCut (k=3)")

You can see there’s a near perfect agreement with K-3 and H-4, then there is some discord in how the two algorithms assigned the other two clusters. Nonetheless the global groupings are still comparable.

All in all, choosing the optimal number of clusters is more of an art than an exact science. Especially when considering RNAseq data. It really comes down to what you want to see in the data: global profiles or small bunches of co-expressed genes being the two extremes. The main takeaways here are:

  1. Use multiple algorithms and comapre them.
  2. Use the strategies above (SSE, CH index, Gap statistic) to guide the analysis then:
  3. Compare results from multiple K numbers.
  4. If your profiles become too redundant when you plot the centroids, use a smaller K.
  5. If when plotting the profiles of individual genes you see outliers that don’t fit, use a larger K.
  6. Consider filtering the data based on mean and variance to exclude noise.

Hope this helps! Check out part three on fuzzy c-means clustering here

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10 (Yosemite)
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## other attached packages:
##  [1] gplots_3.0.1               dendextend_1.5.2          
##  [3] reshape_0.8.6              ggplot2_2.2.1             
##  [5] apcluster_1.4.3            vegan_2.4-3               
##  [7] lattice_0.20-35            permute_0.9-4             
##  [9] cluster_2.0.6              SummarizedExperiment_1.2.3
## [11] Biobase_2.32.0             GenomicRanges_1.24.3      
## [13] GenomeInfoDb_1.8.7         IRanges_2.6.1             
## [15] S4Vectors_0.10.3           BiocGenerics_0.18.0       
## [17] edgeR_3.14.0               limma_3.28.21             
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       modeltools_0.2-21  kernlab_0.9-25    
##  [4] colorspace_1.3-2   htmltools_0.3.6    viridisLite_0.2.0 
##  [7] yaml_2.1.14        mgcv_1.8-17        rlang_0.1.2       
## [10] prabclus_2.2-6     fpc_2.1-10         plyr_1.8.4        
## [13] robustbase_0.92-7  stringr_1.2.0      zlibbioc_1.18.0   
## [16] munsell_0.4.3      gtable_0.2.0       caTools_1.17.1    
## [19] mvtnorm_1.0-6      evaluate_0.10      labeling_0.3      
## [22] knitr_1.16         flexmix_2.3-14     class_7.3-14      
## [25] DEoptimR_1.0-8     trimcluster_0.1-2  Rcpp_0.12.12      
## [28] KernSmooth_2.23-15 scales_0.4.1       backports_1.1.0   
## [31] diptest_0.75-7     gdata_2.18.0       XVector_0.12.1    
## [34] gridExtra_2.2.1    digest_0.6.12      stringi_1.1.5     
## [37] grid_3.3.0         rprojroot_1.2      bitops_1.0-6      
## [40] tools_3.3.0        magrittr_1.5       lazyeval_0.2.0    
## [43] tibble_1.3.3       whisker_0.3-2      MASS_7.3-47       
## [46] Matrix_1.2-10      rmarkdown_1.6      viridis_0.4.0     
## [49] mclust_5.3         nnet_7.3-12        nlme_3.1-131