Contents

1 Introduction

The aim of this document is to measure the performance of the HDF5Array package for normalization and PCA of on-disk single cell data, two computationally intensive operations at the core of single cell analysis.

The benchmarks presented in this document were specifically designed to observe the impact of two critical parameters on performance:

  1. data representation (i.e. sparse vs dense);
  2. size of the blocks used for block-processed operations.

Hopefully these benchmarks will also facilitate comparing performance of single cell analysis workflows based on HDF5Array with workflows based on other tools like Seurat or Scanpy.

2 Install and load the required packages

Let’s install and load HDF5Array as well as the other packages used in this vignette:

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)

Load the packages:

library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)

3 The test datasets

3.1 Sparse vs dense representation

The datasets that we will use for our benchmarks are subsets of the 1.3 Million Brain Cell Dataset from 10x Genomics.

The 1.3 Million Brain Cell Dataset is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It’s available via the ExperimentHub package in two forms, one that uses a sparse representation and one that uses a dense representation:

hub <- ExperimentHub()
hub["EH1039"]$description  # sparse representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally provided by TENx Genomics"
hub["EH1040"]$description  # dense representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. Full rectangular, block-compressed format, 1GB block size."

The two datasets are big HDF5 files stored on a remote location. Let’s download them to the local ExperimentHub cache if they are not there yet:

## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while!
full_sparse_h5 <- hub[["EH1039"]]
full_dense_h5  <- hub[["EH1040"]]

3.2 TENxMatrix vs HDF5Matrix objects

We use the TENxMatrix() and HDF5Array() constructors to bring the sparse and dense datasets in R as DelayedArray derivatives. Note that this does not load the matrix data in memory.

Bring the sparse dataset in R:

## Use 'h5ls(full_sparse_h5)' to find out the group.
full_sparse <- TENxMatrix(full_sparse_h5, group="mm10")

Note that full_sparse is a 27,998 x 1,306,127 TENxMatrix object:

class(full_sparse)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_sparse)
## [1]   27998 1306127

See ?TENxMatrix in the HDF5Array package for more information about TENxMatrix objects.

Bring the dense dataset in R:

## Use 'h5ls(full_dense_h5)' to find out the name of the dataset.
full_dense <- HDF5Array(full_dense_h5, name="counts")

Note that full_dense is a 27,998 x 1,306,127 HDF5Matrix object that contains the same data as full_sparse:

class(full_dense)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_dense)
## [1]   27998 1306127

See ?HDF5Matrix in the HDF5Array package for more information about HDF5Matrix objects.

Finally note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them to full_sparse:

dimnames(full_dense) <- dimnames(full_sparse)

3.3 Create the test datasets

For our benchmarks, below we create subsets of the 1.3 Million Brain Cell Dataset of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells. Note that subsetting a TENxMatrix or HDF5Matrix object with [ is a delayed operation so has virtually no cost:

sparse1 <- full_sparse[ , 1:12500]
dense1  <- full_dense[ , 1:12500]

sparse2 <- full_sparse[ , 1:25000]
dense2  <- full_dense[ , 1:25000]

sparse3 <- full_sparse[ , 1:50000]
dense3  <- full_dense[ , 1:50000]

sparse4 <- full_sparse[ , 1:100000]
dense4  <- full_dense[ , 1:100000]

sparse5 <- full_sparse[ , 1:200000]
dense5  <- full_dense[ , 1:200000]

4 Block-processed normalization and PCA

4.1 Code used for normalization and PCA

We’ll use the following code for normalization:

## Also does variable gene selection by keeping the 1000 most variable
## genes by default.
simple_normalize <- function(mat, num_variable_genes=1000)
{
    stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
    mat <- mat[rowSums(mat) > 0, ]
    mat <- t(t(mat) * 10000 / colSums(mat))
    row_vars <- rowVars(mat)
    rv_order <- order(row_vars, decreasing=TRUE)
    variable_idx <- head(rv_order, n=num_variable_genes)
    mat <- log1p(mat[variable_idx, ])
    mat / rowSds(mat)
}

and the following code for PCA:

simple_PCA <- function(mat, k=25)
{
    stopifnot(length(dim(mat)) == 2)
    row_means <- rowMeans(mat)
    Ax <- function(x, args)
        (as.numeric(mat %*% x) - row_means * sum(x))
    Atx <- function(x, args)
        (as.numeric(x %*% mat) - as.vector(row_means %*% x))
    RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}

4.2 Block processing and block size

Note that the implementations of simple_normalize() and simple_PCA() are expected to work on any matrix-like object regardless of its exact type/representation e.g. it can be an ordinary matrix, a SparseMatrix object from the SparseArray package, a dgCMatrix object from the Matrix package, a DelayedMatrix derivative (TENxMatrix, HDF5Matrix, TileDBMatrix), etc…

However, when the input is a DelayedMatrix object or derivative, it’s important to be aware that:

  • Summarization methods like sum(), colSums(), rowVars(), or rowSds(), and matrix multiplication (%*%), are block-processed operations.

  • The block size is 100 Mb by default. Increasing or decreasing the block size will typically increase or decrease the memory usage of block-processed operations. It will also impact performance, sometimes in unexpected or counter-intuitive ways.

  • The block size can be controlled with DelayedArray::getAutoBlockSize() and DelayedArray::setAutoBlockSize().

For our benchmarks below, we’ll use the following block sizes:

  • normalization of the sparse datasets: 250 Mb
  • normalization of the dense datasets: 40 Mb
  • PCA on the normalized sparse datasets: 40 Mb
  • PCA on the normalized dense datasets: 100 Mb (the default)

4.3 Monitoring memory usage

While manually running our benchmarks below on a Linux system, we will also monitor memory usage at the command line in a terminal with:

(while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &

where <PID> is the process id of our R session. This will allow us to measure the maximum amount of memory used by the calls to simple_normalize() or simple_PCA().

5 Normalization benchmarks

In this section we run simple_normalize() on the two smaller test datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense), and we report timings and memory usage.

See Timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

5.1 Normalizing the sparse datasets

DelayedArray::setAutoBlockSize(2.5e8)  # blocks of 250 Mb
## automatic block size set to 2.5e+08 bytes (was 1e+08)

5.1.1 27,998 x 12,500 sparse dataset

dim(sparse1)
## [1] 27998 12500
system.time(sparse1n <- simple_normalize(sparse1))
##    user  system elapsed 
##  41.384   2.820  44.396
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used  (Mb)
## Ncells  9710738 518.7   17631051 941.6         NA  11414772 609.7
## Vcells 25339229 193.4   98950476 755.0     196608 128758432 982.4

Saving the normalized dataset to a temporary file for PCA later:

dim(sparse1n)
sparse1n_path <- tempfile()
writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0)

5.1.2 27,998 x 25,000 sparse dataset

dim(sparse2)
## [1] 27998 25000
system.time(sparse2n <- simple_normalize(sparse2))
##    user  system elapsed 
##  73.006   7.230  80.881
gc()
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  9729464 519.7   17631051  941.6         NA  11414772  609.7
## Vcells 25495458 194.6  171271605 1306.7     196608 210873372 1608.9

Saving the normalized dataset to a temporary file for PCA later:

dim(sparse2n)
sparse2n_path <- tempfile()
writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0)

5.1.3 About memory usage

With this block size (250 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 3.7 Gb at all time.

5.2 Normalizing the dense datasets

DelayedArray::setAutoBlockSize(4e7)  # blocks of 40 Mb
## automatic block size set to 4e+07 bytes (was 2.5e+08)

5.2.1 27,998 x 12,500 dense dataset

dim(dense1)
## [1] 27998 12500
system.time(dense1n <- simple_normalize(dense1))
##    user  system elapsed 
##  36.794   4.623  41.568
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9729865 519.7   17631051 941.6         NA  11414772  609.7
## Vcells 25569674 195.1   79009835 602.8     196608 214125843 1633.7

Saving the normalized dataset to a temporary file for PCA:

dim(dense1n)
dense1n_path <- tempfile()
writeHDF5Array(dense1n, dense1n_path, name="normalized_counts", level=0)

5.2.2 27,998 x 25,000 dense dataset

dim(dense2)
## [1] 27998 25000
system.time(dense2n <- simple_normalize(dense2))
##    user  system elapsed 
##  76.230   9.807  86.637
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9732908 519.8   17631051 941.6         NA  11414772  609.7
## Vcells 25665227 195.9   76638892 584.8     196608 214125843 1633.7

Saving the normalized dataset to a temporary file for PCA:

dim(dense2n)
dense2n_path <- tempfile()
writeHDF5Array(dense2n, dense2n_path, name="normalized_counts", level=0)

5.2.3 About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.8 Gb at all time.

6 PCA benchmarks

In this section we run simple_pca() on the two normalized datasets obtained in the previous section (1000 x 12,500 and 1000 x 25,000, sparse and dense), and we report timings and memory usage.

See Timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

6.1 PCA on the normalized sparse datasets

DelayedArray::setAutoBlockSize(4e7)  # blocks of 40 Mb
## automatic block size set to 4e+07 bytes (was 4e+07)

6.1.1 1000 x 12,500 sparse dataset

sparse1n <- TENxMatrix(sparse1n_path)
dim(sparse1n)
## [1]  1000 12500
system.time(pca1s <- simple_PCA(sparse1n))
##    user  system elapsed 
##  34.079   9.289  49.985
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9739365 520.2   17631051 941.6         NA  11414772  609.7
## Vcells 25962123 198.1   80710084 615.8     196608 214125843 1633.7

6.1.2 1000 x 25,000 sparse dataset

sparse2n <- TENxMatrix(sparse2n_path)
dim(sparse2n)
## [1]  1000 25000
system.time(pca2s <- simple_PCA(sparse2n))
##    user  system elapsed 
##  63.919  15.460  79.677
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9739066 520.2   17631051 941.6         NA  11414772  609.7
## Vcells 26562470 202.7   80710084 615.8     196608 214125843 1633.7

6.1.3 About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.4 Gb at all time.

6.2 PCA on the normalized dense datasets

DelayedArray::setAutoBlockSize(1e8)  # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 4e+07)

6.2.1 1000 x 12,500 dense dataset

dense1n <- HDF5Array(dense1n_path, name="normalized_counts")
dim(dense1n)
## [1]  1000 12500
system.time(pca1d <- simple_PCA(dense1n))
##    user  system elapsed 
##  31.791  14.844  46.904
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9739855 520.2   17631051 941.6         NA  11414772  609.7
## Vcells 26827477 204.7  104317098 795.9     196608 214125843 1633.7

6.2.2 1000 x 25,000 dense dataset

dense2n <- HDF5Array(dense2n_path, name="normalized_counts")
dim(dense2n)
## [1]  1000 25000
system.time(pca2d <- simple_PCA(dense2n))
##    user  system elapsed 
##  56.161  26.476  83.286
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
## Ncells  9739361 520.2   17631051 941.6         NA  11414772  609.7
## Vcells 27388872 209.0  102453644 781.7     196608 214125843 1633.7

6.2.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.7 Gb at all time.

6.3 Sanity checks

stopifnot(all.equal(pca1s, pca1d))
stopifnot(all.equal(pca2s, pca2d))

7 Timings obtained on various systems

Here we report normalization & PCA timings obtained on various systems. For each system, the results are summarized in a table that shows the simple_normalize() and simple_pca() timings obtained on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. The times with light green background correspond to the best time amongst the three different block sizes for a given operation.

7.1 DELL XPS 15 laptop (model 9520)

  • OS: Linux Ubuntu 24.04
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 32GB
  • Disk: 1TB SSD
  • Disk performance:
      # Output of 'sudo hdparm -Tt <device>':
       Timing cached reads:   35188 MB in  2.00 seconds = 17620.75 MB/sec
       Timing buffered disk reads: 7842 MB in  3.00 seconds = 2613.57 MB/sec
    
sparse
(TENxMatrix)
dense
(HDF5Matrix)
object
dimensions
(genes x cells)
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
Normalization
27998 x 12500 sparse1 67 50 42 dense1 51 53 47
27998 x 25000 sparse2 141 113 85 dense2 103 112 100
27998 x 50000 sparse3 275 214 174 dense3 207 225 221
27998 x 100000 sparse4 532 440 346 dense4 416 458 449
27998 x 200000 sparse5 1088 895 758 dense5 856 959 918
PCA
1000 x 12500 sparse1n 34 46 47 dense1n 38 42 36
1000 x 25000 sparse2n 70 81 76 dense2n 74 80 125
1000 x 50000 sparse3n 144 193 167 dense3n 193 179 156
1000 x 100000 sparse4n 296 317 337 dense4n 416 463 425
1000 x 200000 sparse5n 576 671 685 dense5n 1082 889 997

Note: “max. mem. used” columns to be populated soon.

7.2 Supermicro SuperServer 1029GQ-TRT

  • OS: Linux Ubuntu 22.04
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 384GB
  • Disk: 1.3TB ATA Disk
  • Disk performance:
      # Output of 'sudo hdparm -Tt <device>':
       Timing cached reads:   20592 MB in  1.99 seconds = 10361.41 MB/sec
       Timing buffered disk reads: 1440 MB in  3.00 seconds = 479.66 MB/sec
    
sparse
(TENxMatrix)
dense
(HDF5Matrix)
object
dimensions
(genes x cells)
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
Normalization
27998 x 12500 sparse1 119 82 74 dense1 87 88 83
27998 x 25000 sparse2 227 194 144 dense2 177 188 169
27998 x 50000 sparse3 442 346 297 dense3 345 378 352
27998 x 100000 sparse4 855 742 602 dense4 707 736 722
27998 x 200000 sparse5 1818 1472 1275 dense5 1459 1558 1502
PCA
1000 x 12500 sparse1n 75 82 84 dense1n 69 70 70
1000 x 25000 sparse2n 141 181 131 dense2n 128 149 208
1000 x 50000 sparse3n 296 278 309 dense3n 288 290 277
1000 x 100000 sparse4n 510 645 569 dense4n 592 533 744
1000 x 200000 sparse5n 1146 1304 1307 dense5n 1727 1510 1658

Note: “max. mem. used” columns to be populated soon.

7.3 Mac Pro (Apple M2 Ultra)

  • OS: macOS 13.7.1
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 192GB
  • Disk: 2TB SSD
  • Disk performance: N/A
sparse
(TENxMatrix)
dense
(HDF5Matrix)
object
dimensions
(genes x cells)
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
Normalization
27998 x 12500 sparse1 57 42 34 dense1 37 35 32
27998 x 25000 sparse2 119 93 67 dense2 71 75 68
27998 x 50000 sparse3 236 182 141 dense3 141 148 149
27998 x 100000 sparse4 463 361 281 dense4 283 305 305
27998 x 200000 sparse5 949 736 604 dense5 578 618 624
PCA
1000 x 12500 sparse1n 27 33 34 dense1n 32 30 25
1000 x 25000 sparse2n 64 65 60 dense2n 63 56 95
1000 x 50000 sparse3n 119 150 133 dense3n 152 133 111
1000 x 100000 sparse4n 230 254 269 dense4n 301 376 321
1000 x 200000 sparse5n 436 665 528 dense5n 825 714 782

Note: “max. mem. used” columns to be populated soon.

8 Final remarks

The Supermicro SuperServer 1029GQ-TRT machine is significantly slower than the other machines. This is most likely due to the less performant disk.

For PCA, choosing the sparse representation (TENxMatrix) and using small block sizes (40 Mb) is a clear winner.

For normalization, there’s no clear best choice between the parse and dense representations. More precisely, for this operation, the sparse representation tends to give the best times when using bigger blocks (250 Mb), whereas the dense representation tends to give the best times when using smaller blocks (40 Mb). However, based on the above benchmarks, there’s no clear best choice between “sparse with big blocks” and “dense with small blocks” in terms of speed. Maybe extending the benchmarks to include more extreme block sizes (e.g. 20 Mb and 500 Mb) could help break the tie.

Normalization and PCA are roughly linear in time, regardless of representation (sparse or dense) or block size.

[Needs confirmation] Normalization and PCA both perform at almost constant memory, regardless of representation (sparse or dense).

9 Session information

sessionInfo()
## R Under development (unstable) (2024-11-20 r87352)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.7.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TENxBrainData_1.27.0        SingleCellExperiment_1.29.1
##  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [7] RSpectra_0.16-2             DelayedMatrixStats_1.29.1  
##  [9] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
## [11] BiocFileCache_2.15.1        dbplyr_2.5.0               
## [13] HDF5Array_1.35.6            rhdf5_2.51.2               
## [15] DelayedArray_0.33.3         SparseArray_1.7.4          
## [17] S4Arrays_1.7.1              IRanges_2.41.2             
## [19] abind_1.4-8                 S4Vectors_0.45.2           
## [21] MatrixGenerics_1.19.1       matrixStats_1.5.0          
## [23] BiocGenerics_0.53.3         generics_0.1.3             
## [25] Matrix_1.7-1                BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0          xfun_0.50                bslib_0.8.0             
##  [4] lattice_0.22-6           rhdf5filters_1.19.0      vctrs_0.6.5             
##  [7] tools_4.5.0              curl_6.1.0               tibble_3.2.1            
## [10] AnnotationDbi_1.69.0     RSQLite_2.3.9            blob_1.2.4              
## [13] pkgconfig_2.0.3          sparseMatrixStats_1.19.0 lifecycle_1.0.4         
## [16] GenomeInfoDbData_1.2.13  compiler_4.5.0           Biostrings_2.75.3       
## [19] htmltools_0.5.8.1        sass_0.4.9               yaml_2.3.10             
## [22] pillar_1.10.1            crayon_1.5.3             jquerylib_0.1.4         
## [25] cachem_1.1.0             mime_0.12                tidyselect_1.2.1        
## [28] digest_0.6.37            purrr_1.0.2              dplyr_1.1.4             
## [31] bookdown_0.42            BiocVersion_3.21.1       fastmap_1.2.0           
## [34] grid_4.5.0               cli_3.6.3                magrittr_2.0.3          
## [37] withr_3.0.2              filelock_1.0.3           UCSC.utils_1.3.1        
## [40] rappdirs_0.3.3           bit64_4.6.0-1            rmarkdown_2.29          
## [43] XVector_0.47.2           httr_1.4.7               bit_4.5.0.1             
## [46] png_0.1-8                memoise_2.0.1            evaluate_1.0.3          
## [49] knitr_1.49               rlang_1.1.5              Rcpp_1.0.14             
## [52] glue_1.8.0               DBI_1.2.3                BiocManager_1.30.25     
## [55] jsonlite_1.8.9           R6_2.5.1                 Rhdf5lib_1.29.0