1 Introduction

orthosData is the companion database to the orthos software for mechanistic studies using differential gene expression experiments.

It currently encompasses data for over 100,000 differential gene expression mouse and human experiments distilled and compiled from the ARCHS4 database (Lachmann et al., 2018) of uniformly processed RNAseq experiments as well as associated pre-trained variational models.

Together with orthos it was developed to provide a better understanding of the effects of experimental treatments on gene expression and to help map treatments to mechanisms of action.

This vignette provides information on the orhosData functions for retrieval from ExperimentHub and local caching of the models and datasets used internally in orthos.

For more information on usage of these models and datasets for analysis purposes please refer to the orthos package documentation.

2 Installation and overview

orthosData is automatically installed as a dependency during orthos installation. orthosData can also be installed independently from Bioconductor using BiocManager::install():

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("orthosData")
# or also...
BiocManager::install("orthosData", dependencies = TRUE)

After installation, the package can be loaded with:

library(orthosData)

3 Caching orthosData models with GetorthosModels()

GetorthosModels() is the function for local caching of orthosData Keras models for a given organism.

These are the models required to perform inference in orthos.

For each organism they are of three types:

  • ContextEncoder: The encoder component of a context Variational Autoencoder (VAE). Used to produce a latent encoding of a given gene expression profile (i.e context).

    • Input is a gene expression matrix (shape= M x N , where M is the number of condition and N is the number of orthos gene features) in the form of log2-transformed library normalized counts (log2 counts per million, log2CPMs).

    • Output is an M x d (where d=64) latent representation of the context.

  • DeltaEncoder: The encoder component of a contrast conditional Variational Autoencoder (cVAE). Used to produce a latent encoding of a contrast between two conditions (i.e delta).

    • Input is a matrix of gene expression contrasts (shape= M x N ) in the form of gene log2 CPM ratios (log2 fold changes, log2FCs), concatenated with the corresponding context encoding.

    • Output is an M x d (where d=512) latent representation of the contrast, conditioned on the context.

  • DeltaDecoder: The decoder component of the same cVAE as above. Used to produce the decoded version of the contrast between two conditions.

    • Input is the concatenated matrix (shape= M x (N+d) ) of the delta and context latent encodings.

    • Output is the decoded contrast matrix (shape= M x N), conditioned on the context.

For more details on model architecture and use of these models in orthos please refer to the orthos package vignette.

When called for a specific organism, GetorthosModels() downloads the corresponding set of models required for inference from ExperimentHub and caches them in the user ExperimentHub directory (see ExperimentHub::getExperimentHubOption("CACHE"))

# Check the path to the user's ExperimentHub directory:
ExperimentHub::getExperimentHubOption("CACHE")
# Download and cache the models for a specific organism:
GetorthosModels(organism = "Mouse")

4 Cache an orthosData contrast DB with GetorthosContrastDB()

The orthosData contrast database contains differential gene expression experiments compiled from the ARCHS4 database of publicly available expression data. Each entry in the database corresponds to a pair of RNAseq samples contrasting a treatment versus a control condition. A combination of metadata-semantic and quantitative analyses was used to determine the proper assignment of samples to such pairs in orthosData.

The database includes assays with the original contrasts in the form of gene expression log2 CPM ratios (i.e log2 fold changes, log2FCs), precalculated, decoded and residual components of those contrasts using the orthosData models as well as the gene expression context of those contrasts in the form of log2-transformed library normalized counts (i.e log2 counts per million, log2CPMs). It also contains extensive annotation on both the orthos feature genes and the contrasted conditions.

For each organism the DB has been compiled as an HDF5SummarizedExperiment with an HDF5 component that contains the gene assays and an rds component that contains gene annotation in the rowData and the contrast annotation in the colData.

Note that because of the way that HDF5 datasets and serialized SummarizedExperiments are linked in an HDF5SummarizedExperiment, the two components -although relocatable- need to have the exact same filenames as those used at creation time (see HDF5Array::saveHDF5SummarizedExperiment). In other words the files can be moved (or copied) to a different directory or to a different machine and they will retain functionality as long as both live in the same directory and are never renamed.

When GetorthosContrastDB() is called for a specific organism the corresponding HDF5SummarizedExperiment is downloaded from ExperimentHub and cached in the user ExperimentHub directory (see ExperimentHub::getExperimentHubOption("CACHE")).

# Check the path to the user's ExperimentHub directory:
ExperimentHub::getExperimentHubOption("CACHE")

# Download and cache the contrast database for a specific organism.
# Note: mode="DEMO" caches a small "toy" database for the queries for demonstration purposes 
# To download the full db use mode = "ANALYSIS" (this is time and space consuming)
GetorthosContrastDB(organism = "Mouse", mode="DEMO")

# Load the HDF5SummarizedExperiment: 
se <- HDF5Array::loadHDF5SummarizedExperiment(dir = ExperimentHub::getExperimentHubOption("CACHE"),
prefix = "mouse_v212_NDF_c100_DEMO")

Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] orthosData_1.4.0 knitr_1.48       BiocStyle_2.34.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.36.0 KEGGREST_1.46.0            
#>  [3] xfun_0.48                   bslib_0.8.0                
#>  [5] rhdf5_2.50.0                lattice_0.22-6             
#>  [7] Biobase_2.66.0              rhdf5filters_1.18.0        
#>  [9] vctrs_0.6.5                 tools_4.4.1                
#> [11] generics_0.1.3              stats4_4.4.1               
#> [13] curl_5.2.3                  tibble_3.2.1               
#> [15] fansi_1.0.6                 AnnotationDbi_1.68.0       
#> [17] RSQLite_2.3.7               blob_1.2.4                 
#> [19] pkgconfig_2.0.3             Matrix_1.7-1               
#> [21] dbplyr_2.5.0                S4Vectors_0.44.0           
#> [23] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
#> [25] stringr_1.5.1               compiler_4.4.1             
#> [27] Biostrings_2.74.0           fontawesome_0.5.2          
#> [29] GenomeInfoDb_1.42.0         htmltools_0.5.8.1          
#> [31] sass_0.4.9                  yaml_2.3.10                
#> [33] pillar_1.9.0                crayon_1.5.3               
#> [35] jquerylib_0.1.4             DelayedArray_0.32.0        
#> [37] cachem_1.1.0                abind_1.4-8                
#> [39] ExperimentHub_2.14.0        AnnotationHub_3.14.0       
#> [41] tidyselect_1.2.1            digest_0.6.37              
#> [43] stringi_1.8.4               dplyr_1.1.4                
#> [45] bookdown_0.41               BiocVersion_3.20.0         
#> [47] fastmap_1.2.0               grid_4.4.1                 
#> [49] SparseArray_1.6.0           cli_3.6.3                  
#> [51] magrittr_2.0.3              S4Arrays_1.6.0             
#> [53] utf8_1.2.4                  filelock_1.0.3             
#> [55] UCSC.utils_1.2.0            rappdirs_0.3.3             
#> [57] bit64_4.5.2                 rmarkdown_2.28             
#> [59] XVector_0.46.0              httr_1.4.7                 
#> [61] matrixStats_1.4.1           bit_4.5.0                  
#> [63] png_0.1-8                   memoise_2.0.1              
#> [65] HDF5Array_1.34.0            evaluate_1.0.1             
#> [67] GenomicRanges_1.58.0        IRanges_2.40.0             
#> [69] BiocFileCache_2.14.0        rlang_1.1.4                
#> [71] glue_1.8.0                  DBI_1.2.3                  
#> [73] BiocManager_1.30.25         BiocGenerics_0.52.0        
#> [75] jsonlite_1.8.9              Rhdf5lib_1.28.0            
#> [77] R6_2.5.1                    MatrixGenerics_1.18.0      
#> [79] zlibbioc_1.52.0

5 References

Appendix

  1. Lachmann, Alexander, et al. “Massive mining of publicly available RNA-seq data from human and mouse.” Nature communications 9.1 (2018): 1366