Detecting hidden heterogeneity in single cell RNA-Seq data

Donghyung Lee

May 10th, 2018

The iasva package can be used to detect hidden heterogenity within bulk or single cell sequencing data. To illustrate how to use the iasva package for heterogenity detection, we use real-world single cell RNA sequencing (scRNA-Seq) data obtained from human pancreatic islet samples (Lawlor et. al., 2016).

Load packages

library(irlba) 
library(iasva)
library(sva)
library(Rtsne)
library(pheatmap)
library(corrplot)
library(DescTools)
library(RColorBrewer)
library(SummarizedExperiment)
set.seed(100)
color.vec <- brewer.pal(3, "Set1")

Load the islet single cell RNA-Seq data

To illustrate how IA-SVA can be used to detect hidden heterogeneity within a homogenous cell population (i.e., alpha cells), we use read counts of alpha cells from healthy (non-diabetic) subjects (n = 101).

counts_file <- system.file("extdata", "iasva_counts_test.Rds",
                         package = "iasva")
# matrix of read counts where genes are rows, samples are columns
counts <- readRDS(counts_file)
# matrix of sample annotations/metadata
anns_file <- system.file("extdata", "iasva_anns_test.Rds",
                         package = "iasva")
anns <- readRDS(anns_file)

Calculate geometric library size, i.e., library size of log-transfromed read counts.

It is well known that the geometric library size (i.e., library size of log-transfromed read counts) or proportion of expressed genes in each cell explains a very large portion of variability of scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the geometric library size (r ~ 0.9). Here, we calculate the geometric library size vector, which will be used as a known factor in the IA-SVA algorithm.

geo_lib_size <- colSums(log(counts + 1))
barplot(geo_lib_size, xlab = "Cell", ylab = "Geometric Lib Size", las = 2)

lcounts <- log(counts + 1)

# PC1 and Geometric library size correlation
pc1 <- irlba(lcounts - rowMeans(lcounts), 1)$v[, 1]
cor(geo_lib_size, pc1)
## [1] -0.99716

Run IA-SVA

Here, we run IA-SVA using patient_id and geo_lib_size as known factors and identify five hidden factors. SVs are plotted in a pairwise fashion to uncover which SVs can seperate cell types.

set.seed(100)
patient_id <- anns$Patient_ID
mod <- model.matrix(~patient_id + geo_lib_size)
# create a summarizedexperiment class
summ_exp <- SummarizedExperiment(assays = counts)
iasva.res<- iasva(summ_exp, mod[, -1],verbose = FALSE, 
                  permute = FALSE, num.sv = 5)
## IA-SVA running...
## 
## SV 1 Detected!
## 
## SV 2 Detected!
## 
## SV 3 Detected!
## 
## SV 4 Detected!
## 
## SV 5 Detected!
## 
## # of significant surrogate variables: 5
iasva.sv <- iasva.res$sv
plot(iasva.sv[, 1], iasva.sv[, 2], xlab = "SV1", ylab = "SV2")

cell_type <- as.factor(iasva.sv[, 1] > -0.1) 
levels(cell_type) <- c("Cell1", "Cell2")
table(cell_type)
## cell_type
## Cell1 Cell2 
##     6    95
# We identified 6 outlier cells based on SV1 that are marked in red
pairs(iasva.sv, main = "IA-SVA", pch = 21, col = color.vec[cell_type],
      bg = color.vec[cell_type], oma = c(4,4,6,12))
legend("right", levels(cell_type), fill = color.vec, bty = "n")

plot(iasva.sv[, 1:2], main = "IA-SVA", pch = 21, xlab = "SV1", ylab = "SV2",
     col = color.vec[cell_type], bg = color.vec[cell_type])

cor(geo_lib_size, iasva.sv[, 1])
## [1] -0.1469422
corrplot(cor(iasva.sv))

As shown in the above figure, SV1 clearly separates alpha cells into two groups: 6 outlier cells (marked in red) and the rest of the alpha cells (marked in blue). ## Find marker genes for the detected heterogeneity (SV1). Here, using the find_markers() function we find marker genes that are significantly associated with SV1 (multiple testing adjusted p-value < 0.05, default significance cutoff, and R-squared value > 0.3, default R-squared cutoff).

marker.counts <- find_markers(summ_exp, as.matrix(iasva.sv[,1]))
## # of markers (): 33
## total # of unique markers: 33
nrow(marker.counts)
## [1] 33
rownames(marker.counts)
##  [1] "PMEPA1"   "FAM198B"  "FLT1"     "ENG"      "SOX4"     "ITGA5"   
##  [7] "PXDN"     "PRDM1"    "ERG"      "CLIC4"    "A2M"      "PPAP2B"  
## [13] "THBS1"    "CLIC2"    "S100A16"  "STC1"     "ACVRL1"   "COL4A1"  
## [19] "MSN"      "TNFAIP2"  "MMP2"     "SERPINE1" "SPARC"    "SPARCL1" 
## [25] "ESAM"     "KDR"      "CD9"      "CXCR4"    "PODXL"    "PLVAP"   
## [31] "CALD1"    "MMP1"     "ADAMTS4"
anno.col <- data.frame(cell_type = cell_type)
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
##             cell_type
## 4th-C63_S30     Cell2
## 4th-C66_S36     Cell2
## 4th-C18_S31     Cell2
## 4th-C57_S18     Cell1
## 4th-C56_S17     Cell2
## 4th-C68_S41     Cell2
pheatmap(log(marker.counts + 1), show_colnames = FALSE,
         clustering_method = "ward.D2", cutree_cols = 2,
         annotation_col = anno.col)

Run tSNE to detect the hidden heterogeneity.

For comparison purposes, we applied tSNE on read counts of all genes to identify the hidden heterogeneity. We used the Rtsne R package with default settings.

set.seed(100)
tsne.res <- Rtsne(t(lcounts), dims = 2)

plot(tsne.res$Y, main = "tSNE", xlab = "tSNE Dim1", ylab = "tSNE Dim2",
     pch = 21, col = color.vec[cell_type], bg = color.vec[cell_type],
     oma = c(4, 4, 6, 12))
legend("bottomright", levels(cell_type), fill = color.vec, bty = "n")

As shown above, tSNE fails to detect the outlier cells that are identified by IA-SVA when all genes are used. Same color coding is used as above.

Run tSNE post IA-SVA analyses, i.e., run tSNE on marker genes associated with SV1 as detected by IA-SVA.

Here, we apply tSNE on the marker genes for SV1 obtained from IA-SVA

set.seed(100)
tsne.res <- Rtsne(unique(t(log(marker.counts + 1))), dims = 2)
plot(tsne.res$Y, main = "tSNE post IA-SVA", xlab = "tSNE Dim1",
     ylab = "tSNE Dim2", pch = 21, col = color.vec[cell_type],
     bg = color.vec[cell_type], oma = c(4, 4, 6, 12))
legend("bottomright", levels(cell_type), fill = color.vec, bty = "n")

tSNE using SV1 marker genes better seperate these ourlier cells. This analyses suggest that gene selection using IA-SVA combined with tSNE analyses can be a powerful way to detect rare cells introducing variability in the single cell gene expression data.

Using a faster implementation of IA-SVA (fast_iasva)

Here, we run a faster implementation of IA-SVA using the same known factors (patient_id and geo_lib_size) as demonstrated above. This function is useful when working with particularly large datasets.

iasva.res <- fast_iasva(summ_exp, mod[, -1], num.sv = 5)
## fast IA-SVA running...
## 
## SV 1 Detected!
## 
## SV 2 Detected!
## 
## SV 3 Detected!
## 
## SV 4 Detected!
## 
## SV 5 Detected!
## 
## # of obtained surrogate variables: 5

Tuning parameters for IA-SVA

The R-squared thresholds used to identify marker genes (find_markers) can greatly influence the 1) number of marker genes identified and 2) the quality of clustering results. With the study_R2() function, users can visualize how different R-squared thresholds influence both factors.

study_res <- study_R2(summ_exp, iasva.sv)
## # of markers (): 274
## total # of unique markers: 274
## # of markers (): 177
## total # of unique markers: 177
## # of markers (): 123
## total # of unique markers: 123
## # of markers (): 84
## total # of unique markers: 84
## # of markers (): 54
## total # of unique markers: 54
## # of markers (): 39
## total # of unique markers: 39
## # of markers (): 30
## total # of unique markers: 30
## # of markers (): 23
## total # of unique markers: 23
## # of markers (): 7
## total # of unique markers: 7
## # of markers (): 3
## total # of unique markers: 3
## # of markers (): 0
## total # of unique markers: 0

This function produces a plot of the number of genes selected vs. the cluster quality (average silhouette score) for different R-squared values.

Session Info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.36.0 Biobase_2.66.0             
##  [3] GenomicRanges_1.58.0        GenomeInfoDb_1.42.0        
##  [5] IRanges_2.40.0              S4Vectors_0.44.0           
##  [7] BiocGenerics_0.52.0         MatrixGenerics_1.18.0      
##  [9] matrixStats_1.4.1           RColorBrewer_1.1-3         
## [11] DescTools_0.99.57           corrplot_0.95              
## [13] pheatmap_1.0.12             Rtsne_0.17                 
## [15] sva_3.54.0                  BiocParallel_1.40.0        
## [17] genefilter_1.88.0           mgcv_1.9-1                 
## [19] nlme_3.1-166                iasva_1.24.0               
## [21] irlba_2.3.5.1               Matrix_1.7-1               
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               gld_2.6.6               readxl_1.4.3           
##  [4] rlang_1.1.4             e1071_1.7-16            compiler_4.4.1         
##  [7] RSQLite_2.3.7           png_0.1-8               vctrs_0.6.5            
## [10] crayon_1.5.3            fastmap_1.2.0           XVector_0.46.0         
## [13] rmarkdown_2.28          UCSC.utils_1.2.0        bit_4.5.0              
## [16] xfun_0.48               zlibbioc_1.52.0         cachem_1.1.0           
## [19] jsonlite_1.8.9          blob_1.2.4              highr_0.11             
## [22] DelayedArray_0.32.0     parallel_4.4.1          cluster_2.1.6          
## [25] R6_2.5.1                bslib_0.8.0             limma_3.62.0           
## [28] boot_1.3-31             jquerylib_0.1.4         cellranger_1.1.0       
## [31] Rcpp_1.0.13             knitr_1.48              splines_4.4.1          
## [34] rstudioapi_0.17.1       abind_1.4-8             yaml_2.3.10            
## [37] codetools_0.2-20        lattice_0.22-6          withr_3.0.2            
## [40] KEGGREST_1.46.0         evaluate_1.0.1          survival_3.7-0         
## [43] proxy_0.4-27            Biostrings_2.74.0       munsell_0.5.1          
## [46] scales_1.3.0            rootSolve_1.8.2.4       xtable_1.8-4           
## [49] class_7.3-22            glue_1.8.0              lmom_3.2               
## [52] tools_4.4.1             data.table_1.16.2       annotate_1.84.0        
## [55] locfit_1.5-9.10         Exact_3.3               mvtnorm_1.3-1          
## [58] XML_3.99-0.17           grid_4.4.1              AnnotationDbi_1.68.0   
## [61] edgeR_4.4.0             colorspace_2.1-1        GenomeInfoDbData_1.2.13
## [64] cli_3.6.3               expm_1.0-0              S4Arrays_1.6.0         
## [67] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
## [70] SparseArray_1.6.0       farver_2.1.2            memoise_2.0.1          
## [73] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
## [76] statmod_1.5.0           bit64_4.5.2             MASS_7.3-61