Contents

1 Introduction

The aim of this document is to measure the performance of the HDF5Array package for normalization and PCA of on-disk single cell data, two computationally intensive operations at the core of single cell analysis.

The benchmarks presented in this document were specifically designed to observe the impact of two critical parameters on performance:

  1. data representation (i.e. sparse vs dense);
  2. size of the blocks used for block-processed operations.

Hopefully these benchmarks will also facilitate comparing performance of single cell analysis workflows based on HDF5Array with workflows based on other tools like Seurat or Scanpy.

2 Install and load the required packages

Let’s install and load HDF5Array as well as the other packages used in this vignette:

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)

Load the packages:

library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)

3 The test datasets

3.1 Sparse vs dense representation

The datasets that we will use for our benchmarks are subsets of the 1.3 Million Brain Cell Dataset from 10x Genomics.

The 1.3 Million Brain Cell Dataset is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It’s available via the ExperimentHub package in two forms, one that uses a sparse representation and one that uses a dense representation:

hub <- ExperimentHub()
hub["EH1039"]$description  # sparse representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally provided by TENx Genomics"
hub["EH1040"]$description  # dense representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. Full rectangular, block-compressed format, 1GB block size."

The two datasets are big HDF5 files stored on a remote location. Let’s download them to the local ExperimentHub cache if they are not there yet:

## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while!
full_sparse_h5 <- hub[["EH1039"]]
full_dense_h5  <- hub[["EH1040"]]

3.2 TENxMatrix vs HDF5Matrix objects

We use the TENxMatrix() and HDF5Array() constructors to bring the sparse and dense datasets in R as DelayedArray derivatives. Note that this does not load the matrix data in memory.

Bring the sparse dataset in R:

## Use 'h5ls(full_sparse_h5)' to find out the group.
full_sparse <- TENxMatrix(full_sparse_h5, group="mm10")

Note that full_sparse is a 27,998 x 1,306,127 TENxMatrix object:

class(full_sparse)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_sparse)
## [1]   27998 1306127

See ?TENxMatrix in the HDF5Array package for more information about TENxMatrix objects.

Bring the dense dataset in R:

## Use 'h5ls(full_dense_h5)' to find out the name of the dataset.
full_dense <- HDF5Array(full_dense_h5, name="counts")

Note that full_dense is a 27,998 x 1,306,127 HDF5Matrix object that contains the same data as full_sparse:

class(full_dense)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_dense)
## [1]   27998 1306127

See ?HDF5Matrix in the HDF5Array package for more information about HDF5Matrix objects.

Finally note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them to full_sparse:

dimnames(full_dense) <- dimnames(full_sparse)

3.3 Create the test datasets

For our benchmarks, below we create subsets of the 1.3 Million Brain Cell Dataset of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells. Note that subsetting a TENxMatrix or HDF5Matrix object with [ is a delayed operation so has virtually no cost:

sparse1 <- full_sparse[ , 1:12500]
dense1  <- full_dense[ , 1:12500]

sparse2 <- full_sparse[ , 1:25000]
dense2  <- full_dense[ , 1:25000]

sparse3 <- full_sparse[ , 1:50000]
dense3  <- full_dense[ , 1:50000]

sparse4 <- full_sparse[ , 1:100000]
dense4  <- full_dense[ , 1:100000]

sparse5 <- full_sparse[ , 1:200000]
dense5  <- full_dense[ , 1:200000]

4 Block-processed normalization and PCA

4.1 Code used for normalization and PCA

We’ll use the following code for normalization:

## Also does variable gene selection by keeping the 1000 most variable
## genes by default.
simple_normalize <- function(mat, num_variable_genes=1000)
{
    stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
    mat <- mat[rowSums(mat) > 0, ]
    mat <- t(t(mat) * 10000 / colSums(mat))
    row_vars <- rowVars(mat)
    rv_order <- order(row_vars, decreasing=TRUE)
    variable_idx <- head(rv_order, n=num_variable_genes)
    mat <- log1p(mat[variable_idx, ])
    mat / rowSds(mat)
}

and the following code for PCA:

simple_PCA <- function(mat, k=25)
{
    stopifnot(length(dim(mat)) == 2)
    row_means <- rowMeans(mat)
    Ax <- function(x, args)
        (as.numeric(mat %*% x) - row_means * sum(x))
    Atx <- function(x, args)
        (as.numeric(x %*% mat) - as.vector(row_means %*% x))
    RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}

4.2 Block processing and block size

Note that the implementations of simple_normalize() and simple_PCA() are expected to work on any matrix-like object regardless of its exact type/representation e.g. it can be an ordinary matrix, a SparseMatrix object from the SparseArray package, a dgCMatrix object from the Matrix package, a DelayedMatrix derivative (TENxMatrix, HDF5Matrix, TileDBMatrix), etc…

However, when the input is a DelayedMatrix object or derivative, it’s important to be aware that:

  • Summarization methods like sum(), colSums(), rowVars(), or rowSds(), and matrix multiplication (%*%), are block-processed operations.

  • The block size is 100 Mb by default. Increasing or decreasing the block size will typically increase or decrease the memory usage of block-processed operations. It will also impact performance, sometimes in unexpected or counter-intuitive ways.

  • The block size can be controlled with DelayedArray::getAutoBlockSize() and DelayedArray::setAutoBlockSize().

For our benchmarks below, we’ll use the following block sizes:

  • normalization of the sparse datasets: 250 Mb
  • normalization of the dense datasets: 40 Mb
  • PCA on the normalized sparse datasets: 40 Mb
  • PCA on the normalized dense datasets: 100 Mb (the default)

4.3 Monitoring memory usage

While manually running our benchmarks below on a Linux system, we will also monitor memory usage at the command line in a terminal with:

(while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &

where <PID> is the process id of our R session. This will allow us to measure the maximum amount of memory used by the calls to simple_normalize() or simple_PCA().

5 Normalization benchmarks

In this section we run simple_normalize() on the two smaller test datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense), and we report timings and memory usage.

See Timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

5.1 Normalizing the sparse datasets

DelayedArray::setAutoBlockSize(2.5e8)  # blocks of 250 Mb
## automatic block size set to 2.5e+08 bytes (was 1e+08)

5.1.1 27,998 x 12,500 sparse dataset

dim(sparse1)
## [1] 27998 12500
system.time(sparse1n <- simple_normalize(sparse1))
##    user  system elapsed 
##  93.727   6.239  99.950
gc()
##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  9708960 518.6   17626724 941.4  12985397 693.5
## Vcells 25333290 193.3   98945923 754.9 128752503 982.4

Saving the normalized dataset to a temporary file for PCA later:

dim(sparse1n)
sparse1n_path <- tempfile()
writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0)

5.1.2 27,998 x 25,000 sparse dataset

dim(sparse2)
## [1] 27998 25000
system.time(sparse2n <- simple_normalize(sparse2))
##    user  system elapsed 
## 161.304   8.501 174.097
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  9727686 519.6   17626724  941.4  12985397  693.5
## Vcells 25489507 194.5  171265893 1306.7 210867426 1608.8

Saving the normalized dataset to a temporary file for PCA later:

dim(sparse2n)
sparse2n_path <- tempfile()
writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0)

5.1.3 About memory usage

With this block size (250 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 3.7 Gb at all time.

5.2 Normalizing the dense datasets

DelayedArray::setAutoBlockSize(4e7)  # blocks of 40 Mb
## automatic block size set to 4e+07 bytes (was 2.5e+08)

5.2.1 27,998 x 12,500 dense dataset

dim(dense1)
## [1] 27998 12500
system.time(dense1n <- simple_normalize(dense1))
##    user  system elapsed 
##  88.070  14.361 102.441
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9728087 519.6   17626724 941.4  12985397  693.5
## Vcells 25563711 195.1   79005064 602.8 214118677 1633.6

Saving the normalized dataset to a temporary file for PCA:

dim(dense1n)
dense1n_path <- tempfile()
writeHDF5Array(dense1n, dense1n_path, name="normalized_counts", level=0)

5.2.2 27,998 x 25,000 dense dataset

dim(dense2)
## [1] 27998 25000
system.time(dense2n <- simple_normalize(dense2))
##    user  system elapsed 
## 200.185  36.131 236.436
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9731130 519.7   17626724 941.4  12985397  693.5
## Vcells 25659252 195.8   76634112 584.7 214118677 1633.6

Saving the normalized dataset to a temporary file for PCA:

dim(dense2n)
dense2n_path <- tempfile()
writeHDF5Array(dense2n, dense2n_path, name="normalized_counts", level=0)

5.2.3 About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.8 Gb at all time.

6 PCA benchmarks

In this section we run simple_pca() on the two normalized datasets obtained in the previous section (1000 x 12,500 and 1000 x 25,000, sparse and dense), and we report timings and memory usage.

See Timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

6.1 PCA on the normalized sparse datasets

DelayedArray::setAutoBlockSize(4e7)  # blocks of 40 Mb
## automatic block size set to 4e+07 bytes (was 4e+07)

6.1.1 1000 x 12,500 sparse dataset

sparse1n <- TENxMatrix(sparse1n_path)
dim(sparse1n)
## [1]  1000 12500
system.time(pca1s <- simple_PCA(sparse1n))
##    user  system elapsed 
##  90.429   4.034  88.313
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9737586 520.1   17626724 941.4  12985397  693.5
## Vcells 25956120 198.1   80704072 615.8 214118677 1633.6

6.1.2 1000 x 25,000 sparse dataset

sparse2n <- TENxMatrix(sparse2n_path)
dim(sparse2n)
## [1]  1000 25000
system.time(pca2s <- simple_PCA(sparse2n))
##    user  system elapsed 
## 178.476   8.088 176.890
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9737286 520.1   17626724 941.4  12985397  693.5
## Vcells 26556447 202.7   80704072 615.8 214118677 1633.6

6.1.3 About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.4 Gb at all time.

6.2 PCA on the normalized dense datasets

DelayedArray::setAutoBlockSize(1e8)  # blocks of 100 Mb
## automatic block size set to 1e+08 bytes (was 4e+07)

6.2.1 1000 x 12,500 dense dataset

dense1n <- HDF5Array(dense1n_path, name="normalized_counts")
dim(dense1n)
## [1]  1000 12500
system.time(pca1d <- simple_PCA(dense1n))
##    user  system elapsed 
##  78.138  22.692 101.467
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9738074 520.1   17626724 941.4  12985397  693.5
## Vcells 26821434 204.7  104312264 795.9 214118677 1633.6

6.2.2 1000 x 25,000 dense dataset

dense2n <- HDF5Array(dense2n_path, name="normalized_counts")
dim(dense2n)
## [1]  1000 25000
system.time(pca2d <- simple_PCA(dense2n))
##    user  system elapsed 
## 134.436  42.278 177.417
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9737579 520.1   17626724 941.4  12985397  693.5
## Vcells 27382808 209.0  102448793 781.7 214118677 1633.6

6.2.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained < 2.7 Gb at all time.

6.3 Sanity checks

stopifnot(all.equal(pca1s, pca1d))
stopifnot(all.equal(pca2s, pca2d))

7 Timings obtained on various systems

Here we report normalization & PCA timings obtained on various systems. For each system, the results are summarized in a table that shows the simple_normalize() and simple_pca() timings obtained on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. The times with light green background correspond to the best time amongst the three different block sizes for a given operation.

7.1 DELL XPS 15 laptop (model 9520)

  • OS: Linux Ubuntu 24.04
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 32GB
  • Disk: 1TB SSD
  • Disk performance:
      # Output of 'sudo hdparm -Tt <device>':
       Timing cached reads:   35188 MB in  2.00 seconds = 17620.75 MB/sec
       Timing buffered disk reads: 7842 MB in  3.00 seconds = 2613.57 MB/sec
    
sparse
(TENxMatrix)
dense
(HDF5Matrix)
object
dimensions
(genes x cells)
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
Normalization
27998 x 12500 sparse1 67 50 42 dense1 51 53 47
27998 x 25000 sparse2 141 113 85 dense2 103 112 100
27998 x 50000 sparse3 275 214 174 dense3 207 225 221
27998 x 100000 sparse4 532 440 346 dense4 416 458 449
27998 x 200000 sparse5 1088 895 758 dense5 856 959 918
PCA
1000 x 12500 sparse1n 34 46 47 dense1n 38 42 36
1000 x 25000 sparse2n 70 81 76 dense2n 74 80 125
1000 x 50000 sparse3n 144 193 167 dense3n 193 179 156
1000 x 100000 sparse4n 296 317 337 dense4n 416 463 425
1000 x 200000 sparse5n 576 671 685 dense5n 1082 889 997

Note: “max. mem. used” columns to be populated soon.

7.2 Supermicro SuperServer 1029GQ-TRT

  • OS: Linux Ubuntu 22.04
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 384GB
  • Disk: 1.3TB ATA Disk
  • Disk performance:
      # Output of 'sudo hdparm -Tt <device>':
       Timing cached reads:   20592 MB in  1.99 seconds = 10361.41 MB/sec
       Timing buffered disk reads: 1440 MB in  3.00 seconds = 479.66 MB/sec
    
sparse
(TENxMatrix)
dense
(HDF5Matrix)
object
dimensions
(genes x cells)
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
Normalization
27998 x 12500 sparse1 119 82 74 dense1 87 88 83
27998 x 25000 sparse2 227 194 144 dense2 177 188 169
27998 x 50000 sparse3 442 346 297 dense3 345 378 352
27998 x 100000 sparse4 855 742 602 dense4 707 736 722
27998 x 200000 sparse5 1818 1472 1275 dense5 1459 1558 1502
PCA
1000 x 12500 sparse1n 75 82 84 dense1n 69 70 70
1000 x 25000 sparse2n 141 181 131 dense2n 128 149 208
1000 x 50000 sparse3n 296 278 309 dense3n 288 290 277
1000 x 100000 sparse4n 510 645 569 dense4n 592 533 744
1000 x 200000 sparse5n 1146 1304 1307 dense5n 1727 1510 1658

Note: “max. mem. used” columns to be populated soon.

7.3 Mac Pro (Apple M2 Ultra)

  • OS: macOS 13.7.1
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 192GB
  • Disk: 2TB SSD
  • Disk performance: N/A
sparse
(TENxMatrix)
dense
(HDF5Matrix)
object
dimensions
(genes x cells)
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
object name block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
time
in
seconds
max.
mem.
used
Normalization
27998 x 12500 sparse1 57 42 34 dense1 37 35 32
27998 x 25000 sparse2 119 93 67 dense2 71 75 68
27998 x 50000 sparse3 236 182 141 dense3 141 148 149
27998 x 100000 sparse4 463 361 281 dense4 283 305 305
27998 x 200000 sparse5 949 736 604 dense5 578 618 624
PCA
1000 x 12500 sparse1n 27 33 34 dense1n 32 30 25
1000 x 25000 sparse2n 64 65 60 dense2n 63 56 95
1000 x 50000 sparse3n 119 150 133 dense3n 152 133 111
1000 x 100000 sparse4n 230 254 269 dense4n 301 376 321
1000 x 200000 sparse5n 436 665 528 dense5n 825 714 782

Note: “max. mem. used” columns to be populated soon.

8 Final remarks

The Supermicro SuperServer 1029GQ-TRT machine is significantly slower than the other machines. This is most likely due to the less performant disk.

For PCA, choosing the sparse representation (TENxMatrix) and using small block sizes (40 Mb) is a clear winner.

For normalization, there’s no clear best choice between the parse and dense representations. More precisely, for this operation, the sparse representation tends to give the best times when using bigger blocks (250 Mb), whereas the dense representation tends to give the best times when using smaller blocks (40 Mb). However, based on the above benchmarks, there’s no clear best choice between “sparse with big blocks” and “dense with small blocks” in terms of speed. Maybe extending the benchmarks to include more extreme block sizes (e.g. 20 Mb and 500 Mb) could help break the tie.

Normalization and PCA are roughly linear in time, regardless of representation (sparse or dense) or block size.

[Needs confirmation] Normalization and PCA both perform at almost constant memory, regardless of representation (sparse or dense).

9 Session information

sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TENxBrainData_1.27.0        SingleCellExperiment_1.29.1
##  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [7] RSpectra_0.16-2             DelayedMatrixStats_1.29.1  
##  [9] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
## [11] BiocFileCache_2.15.1        dbplyr_2.5.0               
## [13] HDF5Array_1.35.6            rhdf5_2.51.2               
## [15] DelayedArray_0.33.3         SparseArray_1.7.4          
## [17] S4Arrays_1.7.1              IRanges_2.41.2             
## [19] abind_1.4-8                 S4Vectors_0.45.2           
## [21] MatrixGenerics_1.19.1       matrixStats_1.5.0          
## [23] BiocGenerics_0.53.3         generics_0.1.3             
## [25] Matrix_1.7-1                BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0          xfun_0.50                bslib_0.8.0             
##  [4] lattice_0.22-6           rhdf5filters_1.19.0      vctrs_0.6.5             
##  [7] tools_4.5.0              curl_6.1.0               tibble_3.2.1            
## [10] AnnotationDbi_1.69.0     RSQLite_2.3.9            blob_1.2.4              
## [13] pkgconfig_2.0.3          sparseMatrixStats_1.19.0 lifecycle_1.0.4         
## [16] GenomeInfoDbData_1.2.13  compiler_4.5.0           Biostrings_2.75.3       
## [19] htmltools_0.5.8.1        sass_0.4.9               yaml_2.3.10             
## [22] pillar_1.10.1            crayon_1.5.3             jquerylib_0.1.4         
## [25] cachem_1.1.0             mime_0.12                tidyselect_1.2.1        
## [28] digest_0.6.37            purrr_1.0.2              dplyr_1.1.4             
## [31] bookdown_0.42            BiocVersion_3.21.1       fastmap_1.2.0           
## [34] grid_4.5.0               cli_3.6.3                magrittr_2.0.3          
## [37] withr_3.0.2              filelock_1.0.3           UCSC.utils_1.3.1        
## [40] rappdirs_0.3.3           bit64_4.6.0-1            rmarkdown_2.29          
## [43] XVector_0.47.2           httr_1.4.7               bit_4.5.0.1             
## [46] png_0.1-8                memoise_2.0.1            evaluate_1.0.3          
## [49] knitr_1.49               rlang_1.1.5              Rcpp_1.0.14             
## [52] glue_1.8.0               DBI_1.2.3                BiocManager_1.30.25     
## [55] jsonlite_1.8.9           R6_2.5.1                 Rhdf5lib_1.29.0