This is the second part of the Bioc2016 workshop “Analysis of single-cell RNA-seq data with R and Bioconductor.”
In this part we will cover cluster analysis with the clusterExperiment package. The package is available on Github and will be in Bioconductor devel soon.
The goal of clusterExperiment
is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithms to find common shared clustering patterns. We implement common post-processing steps unrelated to the specific clustering algorithm (e.g. subsampling the data for stability, finding cluster-specific markers).
We will start from the normalized data obtained in the first part of the workshop with scone. The normalized data can be loaded directly from the workshop package.
data(scone_res)
data(ws_input)
# Summarized Experiment
se <- SummarizedExperiment(list(counts=norm_logcounts),
colData=data.frame(batch=batch[colnames(norm_logcounts)],
time_points=bio[colnames(norm_logcounts)]))
The most common workflow for single-cell clustering found in the literature is to start from some form of normalized gene-level summaries (often on the log-scale), and perform the following steps:
Each of these steps forces the researcher to make some choices, e.g.,
These choices are very likely to impact the results.
pca <- prcomp(t(assay(se)), center=TRUE, scale=TRUE)
plot(pca$x, pch=19, col=bigPalette[bio])
legend("topleft", levels(bio), fill=bigPalette)
res1 <- pam(pca$x[,1:2], k=3)
table(res1$clustering)
##
## 1 2 3
## 35 33 23
res2 <- pam(pca$x[,1:3], k=3)
table(res2$clustering)
##
## 1 2 3
## 33 27 31
plot(pca$sdev^2/sum(pca$sdev^2), xlab="PC", ylab="Percent of explained variance")
res3 <- pam(pca$x[,1:6], k=3)
table(res3$clustering)
##
## 1 2 3
## 48 28 15
The main idea behind clusterExperiment
(clusterExperiment) is to automatically perform and compare several clustering results, based on all possible combinations of parameters, and to find a consensus across the different clusterings.
To repeat this simple example within the clusterExperiment
framework, we can use the function clusterMany
.
cm <- clusterMany(se, dimReduce="PCA", nPCADims=c(2, 3, 6),
ks = 3, clusterFunction = "pam")
cm
## class: ClusterExperiment
## dim: 10610 91
## Primary cluster type: clusterMany
## Primary cluster label: nPCAFeatures=2
## Table of clusters (of primary clustering):
## 1 2 3
## 35 33 23
## Total number of clusterings: 3
## No dendrogram present
## -----------
## Workflow progress:
## clusterMany run? Yes
## combineMany run? No
## makeDendrogram run? No
## mergeClusters run? No
apply(clusterMatrix(cm), 2, table)
## nPCAFeatures=2 nPCAFeatures=3 nPCAFeatures=6
## 1 35 33 48
## 2 33 27 28
## 3 23 31 15
One of the main features of the package is the ease of visualization: For instance, we can directly compare the three results with plotClusters
.
defaultMar <- par("mar")
plotCMar <- c(1.1,8.1,4.1,1.1)
par(mar=plotCMar)
plotClusters(cm)
We can also find a consensus across the different choices.
cm <- combineMany(cm)
## Warning in combineMany(cm): no clusters specified to combine, using results
## from clusterMany
plotClusters(cm)
cm
## class: ClusterExperiment
## dim: 10610 91
## Primary cluster type: combineMany
## Primary cluster label: combineMany
## Table of clusters (of primary clustering):
## -1 c1 c2 c3 c4 c5
## 18 32 18 8 9 6
## Total number of clusterings: 4
## No dendrogram present
## -----------
## Workflow progress:
## clusterMany run? Yes
## combineMany run? Yes
## makeDendrogram run? No
## mergeClusters run? No
Notice that samples are now marked as -1
’s to indicate that these are unclustered samples. In this case, we obtain such samples because of the default option miSize=5
which discards all the clusters with less than 5 samples.
Note that, unlike each individual call to pam
, we do not obtain k = 3
clusters. In general, combineMany
results in a larger number of smaller clusters, that can then be merged with the mergeClusters
function.
The basic premise of our workflow is to find small, robust clusters of samples, and then merge them into larger clusters as relevant. We find that many algorithmic methods for choosing the appropriate number of clusters err on the side of too few clusters. However, we find in practice that we tend to prefer to err on finding many clusters and then merging them based on examining the data.
The main use of the package is to apply the RSEC algorithm to single-cell RNA-seq data.
The idea behind RSEC is to find a large number of small, coherent clusters by:
We perform this routine over many parameters and find a single consensus.
RSEC
workflowThe basic intended clustering workflow is implemented in the RSEC
wrapper function and consists of the following steps (also available as individual functions).
clusterMany
. This results in a large collection of clusterings, where each clustering is based on different parameters.combineMany
function.combineMany
using makeDendrogram
mergeClusters
.Given an underlying clustering strategy, e.g., k-means or PAM with a particular choice of K, we repeat the following:
We can use the function clusterSingle
to perform this strategy over a single set of parameters.
cl3 <- clusterSingle(se, subsample = TRUE, sequential = FALSE,
clusterFunction = c("hierarchical01"),
clusterDArgs = list(minSize=5, alpha=0.3),
subsampleArgs = list("k"=3,"clusterFunction"="pam"),
dimReduce = c("PCA"), ndims=10)
plotCoClustering(cl3)
cl7 <- clusterSingle(se, subsample = TRUE, sequential = FALSE,
clusterFunction = c("hierarchical01"),
clusterDArgs = list(minSize=5, alpha=0.3),
subsampleArgs = list("k"=7,"clusterFunction"="pam"),
dimReduce = c("PCA"), ndims=10)
plotCoClustering(cl7)
Note that the final clustering is performed on the matrix D. This can be done with any clustering method. We use a flexible approach of hierarchical clustering and picking clusters so have at least 1-alpha similarity. Selecting a value for alpha seems more intuitive than selecting a number of desired clusters k.
Our sequential clustering works as follows.
This strategy draws ideas from the “tight clustering” algorithm used to find groups of genes in microarray data in (Tseng and Wong 2005).
Again, one can use the clusterSingle
function to perform sequential clustering.
cl <- clusterSingle(se, subsample = TRUE, sequential = TRUE,
clusterFunction = c("hierarchical01"),
clusterDArgs = list(minSize=5, alpha=0.3),
subsampleArgs = list("clusterFunction"="pam"),
seqArgs = list(k0=5, remain.n=10, top.can=5),
dimReduce = c("PCA"), ndims=10)
## Number of points: 91 Dimension: 10
## Looking for cluster 1 ...
## k = 5
## k = 6
## Cluster 1 found. Cluster size: 6 Remaining number of points: 85
## Looking for cluster 2 ...
## k = 4
## k = 5
## Cluster 2 found. Cluster size: 10 Remaining number of points: 75
## Looking for cluster 3 ...
## k = 3
## k = 4
## Cluster 3 found. Cluster size: 21 Remaining number of points: 54
## Looking for cluster 4 ...
## k = 3
## k = 4
## Did not find 5 clusters: found 4,4 clusters for k= 3,4 , respectively
## Cluster 4 found. Cluster size: 6 Remaining number of points: 48
## Looking for cluster 5 ...
## k = 3
## k = 4
## Did not find 5 clusters: found 4,5 clusters for k= 3,4 , respectively
## Cluster 5 found. Cluster size: 5 Remaining number of points: 43
## Looking for cluster 6 ...
## k = 3
## k = 4
## Did not find 5 clusters: found 3,4 clusters for k= 3,4 , respectively
## Cluster 6 found. Cluster size: 7 Remaining number of points: 36
## Looking for cluster 7 ...
## k = 3
## k = 4
## Did not find 5 clusters: found 4,2 clusters for k= 3,4 , respectively
## k = 5
## Did not find 5 clusters: found 2,2 clusters for k= 4,5 , respectively
## Cluster 7 found. Cluster size: 8 Remaining number of points: 28
## Looking for cluster 8 ...
## k = 3
## k = 4
## Did not find 5 clusters: found 1,1 clusters for k= 3,4 , respectively
## Cluster 8 found. Cluster size: 7 Remaining number of points: 21
## Looking for cluster 9 ...
## k = 3
## k = 4
## Did not find 5 clusters: found 1,1 clusters for k= 3,4 , respectively
## k = 5
## Did not find 5 clusters: found 1,1 clusters for k= 4,5 , respectively
## k = 6
## Found 1,0 clusters for k= 5,6 , respectively. Stopping iterating because zero-length cluster.
## Stopped because: Stopped in midst of searching for cluster 9 because no clusters meeting criteria found for iteration k= 6 and previous clusters not similar enough.
cl
## class: ClusterExperiment
## dim: 10610 91
## Primary cluster type: clusterSingle
## Primary cluster label: clusterSingle
## Table of clusters (of primary clustering):
## -1 1 2 3 4 5 6 7 8
## 21 6 10 21 6 5 7 8 7
## Total number of clusterings: 1
## No dendrogram present
## -----------
## Workflow progress:
## clusterMany run? No
## combineMany run? No
## makeDendrogram run? No
## mergeClusters run? No
The final ingredient is finding a consensus across parameter values. For instance, we can change the values of k0 in the sequential clustering. We could use the function clusterMany
to perform many clusterings and then combineMany
to find the consensus clusters. Instead, the package provides a wrapper to perform these steps, where many of the common parameters have reasonable defaults.
RSEC
functionThe RSEC
function implements the following steps:
clusterMany
.combineMany
.makeDendrogram
and mergeClusters
.rs <- RSEC(se, k0s = 4:15, dimReduce = "PCA", nPCADims=c(10, 20),
alphas=c(0.2, 0.3), clusterFunction="hierarchical01", betas=0.7,
combineProportion=0.5, combineMinSize=3,
dendroReduce = "mad", dendroNDims = 1000,
mergeMethod = "adjP", mergeCutoff=0.01,
seqArgs = list(remain.n=10, top.can=5))
## Note: Merging will be done on ' combineMany ', with clustering index 1
rs
## class: ClusterExperiment
## dim: 10610 91
## Primary cluster type: mergeClusters
## Primary cluster label: mergeClusters
## Table of clusters (of primary clustering):
## -1 m1 m10 m11 m2 m3 m4 m5 m6 m7 m8 m9
## 11 21 3 3 7 14 3 3 7 8 4 7
## Total number of clusterings: 50
## Dendrogram run on 'combineMany' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes
## combineMany run? Yes
## makeDendrogram run? Yes
## mergeClusters run? Yes
There are many arguments to the RSEC function.
Argument | Passed to | Meaning |
---|---|---|
dimReduce | transform | Which dimensionality reduction to perform |
nPCADims | transform | Number of PC’s to be used for the clustering |
clusterFunction | clusterD | Function to use in the clustering of the co-clustering matrix |
alphas | clusterD | 1 - similarity required between the clusters in the co-clustering matrix |
k0s | seqCluster | Initial k values for the sequential strategy |
betas | seqCluster | Stability required across parameters to determine that a cluster is stable |
combineProportion | combineMany | Minimum proportion of times two samples should be together to be assigned to a cluster. |
combineMinSize | combineMany | Clusters with size smaller than this will be ignored (resulting in a -1 label) |
dendroReduce | makeDendrogram | How to reduce the dimension of the data when computing the dendrogram |
dendroNDims | makeDendrogram | How many dimensions to use when computing the dendrogram |
mergeMethod | mergeClusters | Method used to merge clusters |
mergeCutoff | mergeClusters | Cutoff to merge clusters |
The plotClusters
function is a good way to get a sense of how many clusterings we tried and to visualize the consensus across parameters.
par(mar=plotCMar)
plotClusters(rs, main="Clusters from RSEC", axisLine=-1,
sampleData=c("time_points", "batch"))
This plot shows the samples in the columns, and different clusterings on the rows. Each sample is color coded based on its clustering for that row, where the colors have been chosen to try to match up clusters across different clusterings that show large overlap. Moreover, the samples have been ordered so that each subsequent clustering (starting at the top and going down) will try to order the samples to keep the clusters together, without rearranging the clustering blocks of the previous clustering/row.
We can see that some clusters are fairly stable across different choices of dimensions while others can vary dramatically. Notice that some samples are white. This indicates that they have the value -1, meaning they were not clustered.
Another good visualization is the plotCoClustering
function, which shows how many times samples in each cluster are together across parameters.
plotCoClustering(rs)
To retrieve the actual results of each clustering, we can use the clusterMatrix
and primaryClusters
functions.
head(clusterMatrix(rs)[,1:3])
## mergeClusters combineMany nPCAFeatures=10,k0=4,alpha=0.2
## [1,] 1 1 1
## [2,] -1 -1 -1
## [3,] 2 2 2
## [4,] 3 3 3
## [5,] 4 4 4
## [6,] 1 5 5
table(primaryClusterNamed(rs))
##
## -1 m1 m10 m11 m2 m3 m4 m5 m6 m7 m8 m9
## 11 21 3 3 7 14 3 3 7 8 4 7
mergeClusters
It is not uncommon that combineMany
will result in too many small clusters, which in practice are too closely related to be useful. Since our final goal is to find gene markers for each clusters, we argue that we can merge clusters that show no or little differential expression (DE) between them.
This functionality is implemented in the mergeClusters
function. mergeClusters
needs a hierarchical clustering of the clusters; it then goes progressively up that hierarchy, deciding whether two adjacent clusters can be merged. The function makeDendrogram
makes such a hierarchy between clusters (by applying hclust
to the medoids of the clusters).
Here, we use the 1,000 most variable genes to make the cluster hierarchy.
manual <- makeDendrogram(rs, whichCluster = "combineMany", dimReduce="mad", ndims=1000)
plotDendrogram(manual)
It is useful to first run mergeClusters
without actually creating any object so as to preview what the final clustering will be (and perhaps to help in setting the cutoff).
mergeClusters(manual, mergeMethod="adjP", plot="adjP", cutoff=0.01)
## Note: Merging will be done on ' combineMany ', with clustering index 2
manual <- mergeClusters(manual, mergeMethod="adjP", plot="none", cutoff=0.01)
## Note: Merging will be done on ' combineMany ', with clustering index 2
par(mar=plotCMar)
plotClusters(manual, whichClusters = c("mergeClusters", "combineMany"))
plotCoClustering(manual, whichClusters=c("mergeClusters","combineMany"))
Notice that mergeClusters
combines clusters based on the actual values of the features, while the coClustering
plot shows how often the samples clustered together.
getBestFeatures
(using limma)Once we are satisfied with our clustering, the next step is usually to identify marker genes for each of the clusters.
The simplest way is to use differentially expressed (DE) genes to identify such markers. First, we will use limma
as a way to compute DE genes.
When comparing multiple classes (in this case, cell types), the simplest way to identify DE genes is to look for genes DE in at least one class. This can be done using an F-test.
The utility function getBestFeatures
uses the lmFit
and topTable
functions from limma to find such DE genes.
rs <- makeDendrogram(rs, dimReduce="mad", ndims=1000)
## set good breaks for heatmap colors
breaks <- c(min(norm_logcounts), seq(0, quantile(norm_logcounts[norm_logcounts > 0], .99, na.rm = TRUE), length = 50), max(norm_logcounts))
genesF <- getBestFeatures(rs, contrastType="F", number=500, isCount=FALSE)
head(genesF)
## IndexInOriginal Feature X.Intercept. cl2 cl3
## gene17844 9286 gene17844 9.168141 -0.3899279 -0.49110647
## gene6288 3307 gene6288 8.765155 -0.5046839 -0.28491910
## gene9313 4848 gene9313 7.165523 -0.0532788 -0.17263151
## gene10610 5568 gene10610 7.798969 -0.1345563 -0.57220605
## gene12034 6310 gene12034 6.523649 -0.2410174 -0.33209073
## gene12575 6607 gene12575 6.335634 0.3520281 0.03898713
## cl4 cl5 cl6 cl7 cl8
## gene17844 -0.7476203 -0.75258328 -0.2398202 -0.22299406 0.1076454
## gene6288 -0.3931857 -0.25096218 0.1811299 0.29617711 0.1311193
## gene9313 -0.6188552 0.10617452 -0.3630871 0.13054810 -0.2726381
## gene10610 -0.4551548 -0.33472294 -0.2554190 -0.04471818 -0.2051005
## gene12034 -0.6024025 -0.22416686 -0.5314232 0.20657270 -0.5604911
## gene12575 0.2087440 0.04471715 0.5338092 0.40055822 -0.2572012
## cl9 cl10 cl11 AveExpr F P.Value
## gene17844 -0.6885514 -0.54436921 0.1792133 8.879978 1557.694 5.602178e-84
## gene6288 -0.3366467 0.03864092 0.5077338 8.690034 1335.482 1.776294e-81
## gene9313 -0.7611721 0.24803020 0.3677072 7.035566 1222.751 4.804488e-80
## gene10610 -0.5069055 0.06878727 -0.7081651 7.552032 1217.942 5.566981e-80
## gene12034 -0.9758899 0.11446865 -0.3178064 6.266565 1103.627 2.214139e-78
## gene12575 0.4534902 0.09728133 0.5928824 6.522229 1042.196 1.880266e-77
## adj.P.Val
## gene17844 5.943911e-80
## gene6288 9.423240e-78
## gene9313 1.476642e-76
## gene10610 1.476642e-76
## gene12034 4.698402e-75
## gene12575 3.324937e-74
plotHeatmap(rs, clusterSamplesData="dendrogramValue",
clusterFeaturesData=unique(genesF[,"IndexInOriginal"]),
main="F statistics",
breaks=breaks, sampleData=c("time_points"))
The F statistic is not particularly useful to identify markers. The getBestFeatures
function offers three alternative approaches.
Pairs
: finds DE genes corresponding to all pairwise comparisons.OneAgainstAll
: finds DE genes comparing one cluster vs. the average of all the others.Dendro
: uses the cluster hierarchy (from the dendrogram) to compute only important contrasts.genesPairs <- getBestFeatures(rs, contrastType="Pairs", number=50, isCount=FALSE)
plotHeatmap(rs, clusterSamplesData="dendrogramValue",
clusterFeaturesData=unique(genesPairs[,"IndexInOriginal"]),
main="All pairwise comparisons",
breaks=breaks, sampleData=c("time_points"))
genesOneAll <- getBestFeatures(rs, contrastType="OneAgainstAll", number=50, isCount=FALSE)
plotHeatmap(rs, clusterSamplesData="dendrogramValue",
clusterFeaturesData=unique(genesOneAll[,"IndexInOriginal"]),
main="One versus All",
breaks=breaks, sampleData=c("time_points"))