lute 1.2.0
User’s Guide for lute
, a framework for bulk transcriptomics deconvolution
experiments. We described this new software and method for cell size rescaling
is in Maden et al. (2024). Below, we provided setup instructions,
introduced essential deconvolution experiment concepts, provided an example
experiment, and linked to resources for further learning.
The lute
package is available from Bioconductor or GitHub. Install and run
lute
using R (v4.3.2 or later) with Bioconductor (v3.19 or later). This
section walks through install and further setup for the conda environment and
Docker image.
Install lute
by passing the following to the console from an active R session.
With the BiocManager
package installed, install lute
from Bioconductor with:
BiocManager::install("lute")
Alternatively, install from GitHub using devtools
:
devtools::install("metamaden/lute")
With lute
successfully installed, run:
library(lute)
Setup instructions for lute
.
From the base directory for the package, create and activate a new environment from the YML script:
conda env create /inst/yml/lute.yml
Running conda info --env
or conda env list
should now show the new
environment. Activate it with conda activate lute
, and install additional
dependencies with /inst/yml/setup.R
. To install lute
, see above.
Access the lute
Docker image with tag madensean758/lute_bioconductor:3.20
:
docker pull madensean758/lute_bioconductor:3.20
docker run madensean758/lute_bioconductor:3.20
If this times out, it may be necessary to log in to Dockerhub with docker login
, check your internet connection, or try again later.
Walks through the definition, experiment elements, simulation assumptions, and rescaling for deconvolution experiments.
Deconvolution is the task of quantifying signal from a signal mixture, which is sometimes called a signal convolution. Deconvolution experiments try to predict signals from their mixtures as accurately and reliably as possible.
In the field of transcriptomics, deconvolution is often applied to gene expression datasets in order to predict cell type quantities from cell mixtures. This technique is applied to bulk tissue specimens of multiple cell types, where accurate cell type quantifications can improve bias adjustments or allow testing of new hypotheses.
Most deconvolution algorithms predict the type-specific proportions from a signature matrix and a convoluted signals matrix in the following manner:
\[P \leftarrow [Y, Z]\] Where we have the following term definitions for bulk transcriptomics deconvolution:
\(P\) : \(K\)-length vector of predicted proportions for the number of cell types \(K\).
\(Z\) : Signature expression matrix, with dimensions of \(G\) total rows of signature genes by \(K\) total columns of cell types.
\(Y\) : The convoluted signals expression matrix, with dimensions of \(G\) total rows of signature genes by \(J\) total columns of bulk samples.
The above terms are shared by reference-based deconvolution algorithms, which are defined by their requirement of a signature matrix \(Z\) to make predictions.
Some additional important properties to consider are the preparation steps used to generate the terms \(Z\) and \(Y\). These steps might include data rescaling and transformation of the expression signals, or certain selection criteria to arrive at the final set of \(G\) signature genes.
Another important representation of deconvolution is in terms of a pseudobulk equation. A pseudobulk is a simulated bulk sample generated using either cell- or type-level reference data. Pseudobulk analysis is a common task for deconvolution experiments.
We may represent the pseudobulked sample \(Y_{PB}\) in terms of the \(Z\) signature matrix and \(P\) cell type proportions, defined as above, as well as some \(K\)-length vector \(S\), which contains cell size estimates used for rescaling \(Z\). This looks like the following:
\[Y_{PB}=Z * P * S\]
Inclusion of the \(S\) vector of cell size scaling factors is an important strategy to correct for potential bias due to systematic differences in sizes between the sizes of predicted cell types. It has been used in deconvolution studies of a variety of tissues, including brain and blood (Monaco et al. (2019), Racle and Gfeller (2020), Sosina et al. (2021)). Proportions predictions not incorporating cell size scale factors assume equal cell sizes between the predicted types, leading to very biased predictions in tissues such as brain that feature principal cell types with very different sizes.
There are dozens of different deconvolution algorithms used in the field of transcriptomics alone. These may be organized in several ways, such as whether they incorporate a variance weighting strategy or cell size scale factors. Another useful way to organize algorithms is by whether they incorporate the non-negative least squares (NNLS) algorithm in some way. This is a statistical approach with the added constraint of assumed non-negativity for inputs, which holds when we consider typical gene expression datasets in the form of counts or log-normalized counts.
The next section shows a hierarchical class structure for accessing multiple deconvolution algorithms, and we can think of this hierarchy as yet another way of organizing and relating different deconvolution algorithms in a useful and actionable manner.
lute
framework for deconvolutionlute
supports deconvolution experiments by coupling convenient management of
standard experiment tasks with standard mappings of common inputs to their
synonyms in supported algorithms for marker selection and deconvolution.
Accessing the framework is as simple as running the lute()
function using
your experiment data (see ?lute
for details). Runnable examples using the
framework function are provided below.
To view a table of supported algorithms and their details, pass the following code to your R session:
info.table <- luteSupportedDeconvolutionAlgorithms()[,seq(5)]
knitr::kable(info.table)
diagram | method_shortname | method_fullname | strict_method_used | package |
---|---|---|---|---|
1 | nnls | non-negative least squares | nnls | nnls |
2 | music | MUlti-Subject SIngle Cell deconvolution | nnls | MuSiC |
5 | music2 | MUlti-Subject SIngle Cell deconvolution 2 | nnls | MuSiC |
5 | music2 | MUlti-Subject SIngle Cell deconvolution 2 | nnls | MuSiC2 |
3 | EPIC | EPIC | EPIC | |
4 | DeconRNASeq | DeconRNASeq | DeconRNASeq | |
6 | Bisque | Bisque | nnls | BisqueRNA |
8 | SCDC | nnls | SCDC |
deconvolutionParam
class hierarchyBriefly, lute
defines a new class hierarchy that organizes deconvolution
functions. This was based on the approach uesd by the bluster
R/Bioconductor
package for clustering algorithms.
Classes in this hierarchy each have a method defined for the deconvolution()
generic function (see ?deconvolution
for details). Methods for higher-level
parent classes like referencebasedParam
will perform data summary and marker
genes comparisons. For the algorithm-specific classes, such as nnlsParam
or
musicParam
, the method maps standard inputs like \(Y\), \(Z\), and \(S\) to their
algorithm-specific synonyms.
The deconvolutionParam
class hierarchy for supported algorithms is visualized
in the following diagram:
The deconvolutionParam
class hierarchy is intended to be extensible to new
algorithms. A future vignette will provide a step-by-step guide to supporting
a new algorithms using these classes and their methods.
This section features several runnable examples of deconvolution experiments
using the lute()
framework function.
We may begin with an object of type SingleCellExperiment
(a.k.a. “sce” object)
containing cell-level expression data. We may use the main lute()
framework
function to perform a deconvolution experiment with these data.
For demonstration, we generate a randomized sce object with
randomSingleCellExperiment()
(see ?randomSingleCellExperiment
for details).
Set the \(G\) marker genes to be 10, or 5 per cell type, which we will identify
from an initial set of 100 genes:
markersPerType <- 5
totalGenes <- 100
exampleSingleCellExperiment <- randomSingleCellExperiment(
numberGenes=totalGenes)
We could identify marker genes in exampleSingleCellExperiment
now by
running lute
without setting the deconvolution algorithm argument:
markers <- lute(singleCellExperiment=exampleSingleCellExperiment,
markersPerType=markersPerType,
deconvolutionAlgorithm=NULL)
## Parsing marker gene arguments...
## Using meanratiosParam...
## selecting among 100 genes for markers of type: type1...
## Selecting by gene
## selecting among 95 genes for markers of type: type2...
## Selecting by gene
## Filtering singleCellExperiment...
length(markers$typemarkerResults)
## [1] 10
We identified 10 marker genes from the provided data.
To run a more complete experiment, we need to define \(Y\). For this
demonstration, we define \(Y\) as a pseudobulk from the provided
exampleSingleCellExperiment
data using ypb_from_sce()
(see ?ypb_from_sce
for details).
examplePseudobulkExpression <- exampleBulkExpression <- ypb_from_sce(
singleCellExperiment=exampleSingleCellExperiment)
By default, the pseudobulk used all available cell data in
exampleSingleCellExperiment
.
Finally, we run marker selection and deconvolution in succession with lute
as
follows:
experiment.results <- lute(singleCellExperiment=exampleSingleCellExperiment,
bulkExpression=as.matrix(exampleBulkExpression),
typemarkerAlgorithm=NULL)
## Parsing deconvolution arguments...
## Using NNLS...
By default, lute
used the mean ratios algorithm to get markers, then the
NNLS algorithm to get cell type proportion predictions.
We can inspect the predicted cell type proportions \(P\) from experiment.results
:
experiment.results$deconvolution.results
## NULL
This User’s Guide introduced the lute
framework for deconvolution, including
an outline of deconvolution experiment variables, an introduction to the
different deconvolution methods for bulk transcriptomics, and a small runnable
example showing how to access the NNLS algorithm by calling the deconvolution
generic on an object of class nnlsParam
.
If you found this guide useful, you may also find the other vignettes included
in lute
to be helpful. The vignette “Pseudobulk cell size rescaling example”
shows a small example experiment contrasting predictions with and without using
cell scale factors using real human brain single-nucleus RNA-seq data accessed
from the scRNAseq
package.
More background about cell scale factors can be found in the Maden et al. (2023) commentary, and examples of their uses can be read about in Monaco et al. (2019), Racle and Gfeller (2020), and Sosina et al. (2021).
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.5.1 lute_1.2.0
## [3] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0
## [5] Biobase_2.66.0 GenomicRanges_1.58.0
## [7] GenomeInfoDb_1.42.0 IRanges_2.40.0
## [9] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [11] MatrixGenerics_1.18.0 matrixStats_1.4.1
## [13] BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
## [4] fastmap_1.2.0 bluster_1.16.0 digest_0.6.37
## [7] rsvd_1.0.5 lifecycle_1.0.4 cluster_2.1.6
## [10] statmod_1.5.0 magrittr_2.0.3 compiler_4.4.1
## [13] rlang_1.1.4 sass_0.4.9 tools_4.4.1
## [16] igraph_2.1.1 utf8_1.2.4 yaml_2.3.10
## [19] knitr_1.48 S4Arrays_1.6.0 labeling_0.4.3
## [22] dqrng_0.4.1 DelayedArray_0.32.0 abind_1.4-8
## [25] BiocParallel_1.40.0 withr_3.0.2 grid_4.4.1
## [28] fansi_1.0.6 beachmat_2.22.0 colorspace_2.1-1
## [31] edgeR_4.4.0 scales_1.3.0 tinytex_0.53
## [34] cli_3.6.3 rmarkdown_2.28 crayon_1.5.3
## [37] generics_0.1.3 metapod_1.14.0 httr_1.4.7
## [40] scuttle_1.16.0 cachem_1.1.0 zlibbioc_1.52.0
## [43] nnls_1.6 parallel_4.4.1 BiocManager_1.30.25
## [46] XVector_0.46.0 vctrs_0.6.5 Matrix_1.7-1
## [49] jsonlite_1.8.9 bookdown_0.41 BiocSingular_1.22.0
## [52] BiocNeighbors_2.0.0 irlba_2.3.5.1 magick_2.8.5
## [55] locfit_1.5-9.10 limma_3.62.0 jquerylib_0.1.4
## [58] glue_1.8.0 codetools_0.2-20 gtable_0.3.6
## [61] UCSC.utils_1.2.0 ScaledMatrix_1.14.0 munsell_0.5.1
## [64] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
## [67] GenomeInfoDbData_1.2.13 R6_2.5.1 evaluate_1.0.1
## [70] lattice_0.22-6 highr_0.11 scran_1.34.0
## [73] bslib_0.8.0 Rcpp_1.0.13 SparseArray_1.6.0
## [76] xfun_0.48 pkgconfig_2.0.3
Maden, Sean K., Louise A. Huuki-Myers, Sang Ho Kwon, Leonardo Collado-Torres, Kristen R. Maynard, and Stephanie C. Hicks. 2024. “Lute: Estimating the Cell Composition of Heterogeneous Tissue with Varying Cell Sizes Using Gene Expression.” bioRxiv. https://doi.org/10.1101/2024.04.04.588105.
Maden, Sean K., Sang Ho Kwon, Louise A. Huuki-Myers, Leonardo Collado-Torres, Stephanie C. Hicks, and Kristen R. Maynard. 2023. “Challenges and Opportunities to Computationally Deconvolve Heterogeneous Tissue with Varying Cell Sizes Using Single Cell RNA-Sequencing Datasets.” arXiv. https://doi.org/10.48550/arXiv.2305.06501.
Monaco, Gianni, Bernett Lee, Weili Xu, Seri Mustafah, You Yi Hwang, Christophe Carré, Nicolas Burdin, et al. 2019. “RNA-Seq Signatures Normalized by mRNA Abundance Allow Absolute Deconvolution of Human Immune Cell Types.” Cell Reports 26 (6): 1627–1640.e7. https://doi.org/10.1016/j.celrep.2019.01.041.
Racle, Julien, and David Gfeller. 2020. “EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data.” Edited by Sebastian Boegel, Methods in Molecular Biology,, 233–48. https://doi.org/10.1007/978-1-0716-0327-7_17.
Sosina, Olukayode A., Matthew N. Tran, Kristen R. Maynard, Ran Tao, Margaret A. Taub, Keri Martinowich, Stephen A. Semick, et al. 2021. “Strategies for Cellular Deconvolution in Human Brain RNA Sequencing Data,” no. 10:750 (August). https://doi.org/10.12688/f1000research.50858.1.