Cross Species Investigation and Analysis (CoSIA
) is a package that provides researchers with an alternative methodology for comparing across species and tissues using normal wild-type RNA-Seq Gene Expression data from Bgee. Using RNA-Seq Gene Expression data, CoSIA provides multiple visualization tools to explore the transcriptome diversity and variation across genes, tissues, and species. CoSIA uses the Coefficient of Variation and Shannon Entropy and Specificity to calculate transcriptome diversity and variation. CoSIA also provides additional conversion tools and utilities to provide a streamlined methodology for cross-species comparison.
CoSIA is split into 3 methods that provide various resources in order for researchers to conduct cross species analysis using gene expression metrics.
Method 1 uses getConversion
to convert inputs between different gene identifiers in the same species as well as orthologs in different species. The other modules access tissue- and/or species-specific gene expression.
Method 2 uses getGEx
to obtain raw read counts that undergo Variance Stabilizing Transformation via DESeq2 methodology. Gene expression values are visualized for a single gene across multiple tissues in single model organism or across multiple species in a single tissue using the plotting methods, plotTissueGEx
& plotSpeciesGEx
, respectively.
Method 3 uses getGExMetrics
to calculate median-based Coefficient of Variation (variability) and Shannon Entropy (diversity & specificity). There are two accompanying plotting methods, plotCVGEx
& plotDSGEx
that are used to visualize the variation and diversity & specificity (DS) of gene expression across genes, tissues, and species.
In R:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CoSIA")
library(CoSIA)
load("../inst/extdata/proccessed/monogenic_kidney_genes.rda")
# downsampling data for figure
set.seed(42)
monogenic_kid_sample <- dplyr::sample_n(monogenic_kidney_genes, 20)
Slot Name | Possible Value Options | Default |
---|---|---|
gene_set | “character”, c(characters), data.frame$column | N/A |
i_species; o_species; map_species | “h_sapiens”, “m_musculus”, “r_noregicus”, “d_rerio”, “d_melangoaster”, “c_elegans” | N/A |
mapping_tool | “annotationDBI”, “biomaRt” | annotationDBI |
input_id; output_ids | “Ensembl_id”, “Entrez_id”, “Symbol” | N/A |
ortholog_database | “NCBIOrtho”, “HomoloGene” | HomoloGene |
map_tissue | c(“tissue”), “tissue”; see getTissues |
N/A |
metric_type | “CV_Tissue”, “CV_Species”, “DS_Gene”, “DS_Gene_all”, “DS_Tissue”, “DS_Tissue_all” | N/A |
getTissues
The function getTissues
retrieves tissues available for a single species:
CoSIA::getTissues("d_rerio")
…or tissues shared across a list of species:
CoSIA::getTissues(c("m_musculus", "r_norvegicus"))
#> # A tibble: 21 × 1
#> Common_Anatomical_Entity_Name
#> <chr>
#> 1 adult mammalian kidney
#> 2 brain
#> 3 cerebellum
#> 4 colon
#> 5 duodenum
#> 6 esophagus
#> 7 frontal cortex
#> 8 heart
#> 9 ileum
#> 10 jejunum
#> # ℹ 11 more rows
NOTE: To compare across all shared tissues for your selected species, you can assign the getTissues
output to an object as input for map_tissues
when initializing a CoSIAn object.
CoSIAn_Obj <- CoSIA::CoSIAn(
gene_set = unique(monogenic_kid_sample$Gene),
i_species = "h_sapiens",
o_species = c(
"h_sapiens",
"r_norvegicus"
),
input_id = "Symbol",
output_ids = "Ensembl_id",
map_species = c(
"h_sapiens",
"r_norvegicus"
),
map_tissues = c(
"adult mammalian kidney",
"heart"
),
mapping_tool = "annotationDBI",
ortholog_database = "HomoloGene",
metric_type = "CV_Species"
)
str(CoSIAn_Obj)
#> Formal class 'CoSIAn' [package "CoSIA"] with 13 slots
#> ..@ gene_set : chr [1:20] "SDCCAG8" "TACO1" "FANCA" "BMP7" ...
#> ..@ i_species : chr "h_sapiens"
#> ..@ input_id : chr "Symbol"
#> ..@ o_species : chr [1:2] "h_sapiens" "r_norvegicus"
#> ..@ output_ids : chr "Ensembl_id"
#> ..@ mapping_tool : chr "annotationDBI"
#> ..@ ortholog_database: chr "HomoloGene"
#> ..@ converted_id :'data.frame': 1 obs. of 1 variable:
#> .. ..$ X0: num 0
#> ..@ map_tissues : chr [1:2] "adult mammalian kidney" "heart"
#> ..@ map_species : chr [1:2] "h_sapiens" "r_norvegicus"
#> ..@ gex :'data.frame': 1 obs. of 1 variable:
#> .. ..$ X0: num 0
#> ..@ metric_type : chr "CV_Species"
#> ..@ metric :'data.frame': 1 obs. of 1 variable:
#> .. ..$ X0: num 0
NOTE: Any species you plan to compare must be specified in map_species
AND o_species
.
NOTE: The getGEx
function requires that “Ensembl_id” is included as an output_id
. Here, we convert gene symbols into Ensembl IDs.
The following use cases provide running examples of CoSIA applications with Natera’s Monogenic Kidney Disease Panel. We will perform id conversion, obtain and visualize gene expression data, and calculate and visualize CV and DS of gene expression across three species (human, mouse, & rat) and two tissues (kidney & heart).
getConversion
)CoSIA can convert input ids to any of the types listed in Table @ref(tab:Table1.).
CoSIAn_Obj_convert <- CoSIA::getConversion(CoSIAn_Obj)
head(CoSIA::viewCoSIAn(CoSIAn_Obj_convert, "converted_id"))
#> [[1]]
#> [1] "SDCCAG8" "SDCCAG8" "TACO1" "FANCA" "BMP7" "AP2S1" "SCARB2"
#> [8] "AQP2" "XDH" "PCBD1" "AMN" "PPP3CA" "LMX1B" "NPHS1"
#> [15] "EBP" "SLX4" "CYP11B2" "TTR" "BICC1" "OFD1" "UCP3"
Now we will use the converted IDs with getGEx
to obtain heart and kidney gene expression data for human, mouse and rat curated from Bgee.
CoSIAn_Obj_gex <- CoSIA::getGEx(CoSIAn_Obj_convert)
head(CoSIA::viewCoSIAn(CoSIAn_Obj_gex, "gex"), 5)
These data can be visualized with plotSpeciesGEx
to plot expression of a single gene in a single tissue across species or plotTissueGEx
to plot expression of a single gene in a single species across tissues.
Here we are plotting gene expression for the TACO1 gene in kidney tissue for human, mouse, and rat.
CoSIAn_Obj_gexplot <- CoSIA::plotSpeciesGEx(CoSIAn_Obj_gex, "adult mammalian kidney", "ENSG00000136463")
CoSIAn_Obj_gexplot
CoSIA produces an interactive plot of expression values for TACO1 in kidney for human mouse and rat. Hovering over the plot displays sample-specific VST read count values. For file size compliance, we have included a static plot for this output. NOTE: Expression values are not meant to be compared across species in this plot. The next two use cases demonstrate appropriate methods for comparing gene expression patterns across species.
Calculating and visualizing median-based coefficient of variation allows for a relative comparison of gene expression variability between species. In CoSIA, CV is calculated as the standard deviation over the median using VST read count values. In the section 1.2.4, we set metric_type= "CV_Species"
to calculate CV for the monogenic kidney gene set across human and rat.
CoSIAn_Obj_CV <- CoSIA::getGExMetrics(CoSIAn_Obj_convert)
CoSIAn_Obj_CVplot <- CoSIA::plotCVGEx(CoSIAn_Obj_CV)
CoSIAn_Obj_CVplot
Here we see relatively low variability in gene expression of most genes in rat and human, with the exception of relatively high levels of variation in metric_type= "CV_Tissue"
displays the variability of gene expression for the monogenic gene set across selected, shared tissues.
Diversity and specificity metrics are also suitable for comparing gene expression patterns across species. In CoSIA, diversity and specificity (based on Shannon’s entropy) are calculated using min-max scaled median VST values generated each gene in a tissue- and species-specific manner. Values are used to calculate diversity and specificity as in Martínez & Reyes-Valdés, 2008.
In the final use case, we are calculating and visualizing diversity and specificity for kidney and heart tissue across the monogenic kidney gene set by setting metric_type= "DS_Tissue"
. There are additional metric_types for the CoSIAn objects, view @ref(tab:Table2.) This example is not run, but the code and the resulting static output is included.
CoSIAn_Obj_DS <- CoSIA::CoSIAn(
gene_set = unique(monogenic_kid_sample$Gene),
i_species = "h_sapiens",
o_species = c("h_sapiens", "r_norvegicus"),
input_id = "Symbol",
output_ids = "Ensembl_id",
map_species = c("h_sapiens", "r_norvegicus"),
map_tissues = c("adult mammalian kidney", "heart"),
mapping_tool = "annotationDBI",
ortholog_database = "HomoloGene",
metric_type = "DS_Tissue"
)
CoSIAn_Obj_DS <- CoSIA::getConversion(CoSIAn_Obj_DS)
CoSIAn_Obj_DS <- CoSIA::getGExMetrics(CoSIAn_Obj_DS)
CoSIAn_Obj_DSplot <- CoSIA::plotDSGEx(CoSIAn_Obj_DS)
CoSIAn_Obj_DSplot
There is low specificity in kidney tissue, indicating there are more genes from the set that are expressed in kidney. We see higher specificity in heart tissue, indicating that there are fewer genes from the set that are expressed in heart.
Metric Type | Function |
---|---|
DS_Gene | calculates diversity and specificity across genes in gene_set in the tissues listed in map_tissue |
DS_Gene_all | calculates diversity and specificity across genes in gene_set in the all the available tissues in a species |
DS_Tissue | calculates diversity and specificity across tissues listed in map_tissue in the genes in gene_set |
DS_Tissue_all | calculates diversity and specificity across tissues listed in map_tissue in the all the genes in a species |
Session info
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CoSIAdata_1.5.2 CoSIA_1.6.0 ExperimentHub_2.14.0
#> [4] AnnotationHub_3.14.0 BiocFileCache_2.14.0 dbplyr_2.5.0
#> [7] BiocGenerics_0.52.0 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.46.0 gtable_0.3.6
#> [3] ggplot2_3.5.1 xfun_0.48
#> [5] bslib_0.8.0 Biobase_2.66.0
#> [7] vctrs_0.6.5 tools_4.4.1
#> [9] generics_0.1.3 stats4_4.4.1
#> [11] curl_5.2.3 tibble_3.2.1
#> [13] fansi_1.0.6 AnnotationDbi_1.68.0
#> [15] RSQLite_2.3.7 blob_1.2.4
#> [17] pkgconfig_2.0.3 homologene_1.4.68.19.3.27
#> [19] S4Vectors_0.44.0 lifecycle_1.0.4
#> [21] GenomeInfoDbData_1.2.13 compiler_4.4.1
#> [23] Biostrings_2.74.0 munsell_0.5.1
#> [25] GenomeInfoDb_1.42.0 htmltools_0.5.8.1
#> [27] sass_0.4.9 yaml_2.3.10
#> [29] pillar_1.9.0 crayon_1.5.3
#> [31] jquerylib_0.1.4 cachem_1.1.0
#> [33] org.Hs.eg.db_3.20.0 mime_0.12
#> [35] tidyselect_1.2.1 digest_0.6.37
#> [37] dplyr_1.1.4 purrr_1.0.2
#> [39] bookdown_0.41 BiocVersion_3.20.0
#> [41] grid_4.4.1 fastmap_1.2.0
#> [43] colorspace_2.1-1 cli_3.6.3
#> [45] org.Rn.eg.db_3.20.0 magrittr_2.0.3
#> [47] utf8_1.2.4 withr_3.0.2
#> [49] scales_1.3.0 filelock_1.0.3
#> [51] UCSC.utils_1.2.0 rappdirs_0.3.3
#> [53] bit64_4.5.2 rmarkdown_2.28
#> [55] XVector_0.46.0 httr_1.4.7
#> [57] bit_4.5.0 png_0.1-8
#> [59] memoise_2.0.1 evaluate_1.0.1
#> [61] knitr_1.48 annotationTools_1.80.0
#> [63] IRanges_2.40.0 rlang_1.1.4
#> [65] glue_1.8.0 DBI_1.2.3
#> [67] BiocManager_1.30.25 jsonlite_1.8.9
#> [69] R6_2.5.1 zlibbioc_1.52.0