orthosData 1.4.0
orthosData
is the companion database to the orthos
software for mechanistic studies
using differential gene expression experiments.
It currently encompasses data for over 100,000 differential gene expression mouse and human experiments distilled and compiled from the ARCHS4 database (Lachmann et al., 2018) of uniformly processed RNAseq experiments as well as associated pre-trained variational models.
Together with orthos
it was developed to provide a better understanding of the effects
of experimental treatments on gene expression and to help map treatments to mechanisms of action.
This vignette provides information on the orhosData
functions for retrieval from ExperimentHub
and local caching of the models and datasets used internally in orthos
.
For more information on usage of these models and datasets for analysis purposes please refer to
the orthos
package documentation.
orthosData
is automatically installed as a dependency during orthos
installation.
orthosData
can also be installed independently from Bioconductor using BiocManager::install()
:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("orthosData")
# or also...
BiocManager::install("orthosData", dependencies = TRUE)
After installation, the package can be loaded with:
library(orthosData)
GetorthosModels()
GetorthosModels()
is the function for local caching of orthosData Keras models for a given organism.
These are the models required to perform inference in orthos
.
For each organism they are of three types:
ContextEncoder: The encoder component of a context Variational Autoencoder (VAE). Used to produce a latent encoding of a given gene expression profile (i.e context).
Input is a gene expression matrix (shape= M x N , where M is the number of condition and N is the number
of orthos
gene features) in the form of log2-transformed library normalized counts (log2 counts per million, log2CPMs).
Output is an M x d (where d=64) latent representation of the context.
DeltaEncoder: The encoder component of a contrast conditional Variational Autoencoder (cVAE). Used to produce a latent encoding of a contrast between two conditions (i.e delta).
Input is a matrix of gene expression contrasts (shape= M x N ) in the form of gene log2 CPM ratios (log2 fold changes, log2FCs), concatenated with the corresponding context encoding.
Output is an M x d (where d=512) latent representation of the contrast, conditioned on the context.
DeltaDecoder: The decoder component of the same cVAE as above. Used to produce the decoded version of the contrast between two conditions.
Input is the concatenated matrix (shape= M x (N+d) ) of the delta and context latent encodings.
Output is the decoded contrast matrix (shape= M x N), conditioned on the context.
For more details on model architecture and use of these models in orthos
please refer to
the orthos
package vignette.
When called for a specific organism, GetorthosModels()
downloads the corresponding set of models
required for inference from ExperimentHub
and caches them in the user ExperimentHub directory
(see ExperimentHub::getExperimentHubOption("CACHE")
)
# Check the path to the user's ExperimentHub directory:
ExperimentHub::getExperimentHubOption("CACHE")
# Download and cache the models for a specific organism:
GetorthosModels(organism = "Mouse")
GetorthosContrastDB()
The orthosData contrast database contains differential gene expression experiments
compiled from the ARCHS4 database of publicly available expression data.
Each entry in the database corresponds to a pair of RNAseq samples contrasting a treatment versus a control condition.
A combination of metadata-semantic and quantitative analyses was used to
determine the proper assignment of samples to such pairs in orthosData
.
The database includes assays with the original contrasts in the form of gene expression log2 CPM ratios (i.e log2 fold changes, log2FCs), precalculated, decoded and residual components of those contrasts using the orthosData models as well as
the gene expression context of those contrasts in the form of log2-transformed library normalized counts (i.e log2 counts per million, log2CPMs).
It also contains extensive annotation on both the orthos
feature genes and the contrasted conditions.
For each organism the DB has been compiled as an HDF5SummarizedExperiment with an HDF5 component that contains the gene assays and an rds component that contains gene annotation in the rowData and the contrast annotation in the colData.
Note that because of the way that HDF5 datasets and serialized SummarizedExperiments are linked in an HDF5SummarizedExperiment, the two components -although relocatable- need to have the exact same filenames as those used at creation time (see HDF5Array::saveHDF5SummarizedExperiment). In other words the files can be moved (or copied) to a different directory or to a different machine and they will retain functionality as long as both live in the same directory and are never renamed.
When GetorthosContrastDB()
is called for a specific organism the corresponding HDF5SummarizedExperiment is downloaded from
ExperimentHub
and cached in the user ExperimentHub directory
(see ExperimentHub::getExperimentHubOption("CACHE")
).
# Check the path to the user's ExperimentHub directory:
ExperimentHub::getExperimentHubOption("CACHE")
# Download and cache the contrast database for a specific organism.
# Note: mode="DEMO" caches a small "toy" database for the queries for demonstration purposes
# To download the full db use mode = "ANALYSIS" (this is time and space consuming)
GetorthosContrastDB(organism = "Mouse", mode="DEMO")
# Load the HDF5SummarizedExperiment:
se <- HDF5Array::loadHDF5SummarizedExperiment(dir = ExperimentHub::getExperimentHubOption("CACHE"),
prefix = "mouse_v212_NDF_c100_DEMO")
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] orthosData_1.4.0 knitr_1.48 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.36.0 KEGGREST_1.46.0
#> [3] xfun_0.48 bslib_0.8.0
#> [5] rhdf5_2.50.0 lattice_0.22-6
#> [7] Biobase_2.66.0 rhdf5filters_1.18.0
#> [9] vctrs_0.6.5 tools_4.4.1
#> [11] generics_0.1.3 stats4_4.4.1
#> [13] curl_5.2.3 tibble_3.2.1
#> [15] fansi_1.0.6 AnnotationDbi_1.68.0
#> [17] RSQLite_2.3.7 blob_1.2.4
#> [19] pkgconfig_2.0.3 Matrix_1.7-1
#> [21] dbplyr_2.5.0 S4Vectors_0.44.0
#> [23] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
#> [25] stringr_1.5.1 compiler_4.4.1
#> [27] Biostrings_2.74.0 fontawesome_0.5.2
#> [29] GenomeInfoDb_1.42.0 htmltools_0.5.8.1
#> [31] sass_0.4.9 yaml_2.3.10
#> [33] pillar_1.9.0 crayon_1.5.3
#> [35] jquerylib_0.1.4 DelayedArray_0.32.0
#> [37] cachem_1.1.0 abind_1.4-8
#> [39] ExperimentHub_2.14.0 AnnotationHub_3.14.0
#> [41] tidyselect_1.2.1 digest_0.6.37
#> [43] stringi_1.8.4 dplyr_1.1.4
#> [45] bookdown_0.41 BiocVersion_3.20.0
#> [47] fastmap_1.2.0 grid_4.4.1
#> [49] SparseArray_1.6.0 cli_3.6.3
#> [51] magrittr_2.0.3 S4Arrays_1.6.0
#> [53] utf8_1.2.4 filelock_1.0.3
#> [55] UCSC.utils_1.2.0 rappdirs_0.3.3
#> [57] bit64_4.5.2 rmarkdown_2.28
#> [59] XVector_0.46.0 httr_1.4.7
#> [61] matrixStats_1.4.1 bit_4.5.0
#> [63] png_0.1-8 memoise_2.0.1
#> [65] HDF5Array_1.34.0 evaluate_1.0.1
#> [67] GenomicRanges_1.58.0 IRanges_2.40.0
#> [69] BiocFileCache_2.14.0 rlang_1.1.4
#> [71] glue_1.8.0 DBI_1.2.3
#> [73] BiocManager_1.30.25 BiocGenerics_0.52.0
#> [75] jsonlite_1.8.9 Rhdf5lib_1.28.0
#> [77] R6_2.5.1 MatrixGenerics_1.18.0
#> [79] zlibbioc_1.52.0