An automated discovery tool for discovering hidden biological and technical links
knowYourCG is a tool for evaluating CpG feature enrichment using Illumina probe IDs. Tthis tool automates the hypothesis testing by asking whether a set of CpGs (indexed by Illumina methylation chip IDs, hence a sparse representation of the methylome) is enriched in certain categories or features. These categories or features can be categorical (e.g., CpGs located at specific tissue-specific transcription factors) or continuous (e.g., the local CpG density of CpGs). Additionally, the set of CpGs to which the test will be applied can be categorical or continuous as well.
The set of CpGs that will be tested for enrichment is called the query set, and the set of CpGs that will be used to determine enrichment is called the database set. A query set, for example, might be the results of a differential methylation analysis or from an epigenome-wide association study. We have taken the time to curate our own database sets from a variety of sources that describe different categorical and continuous features such as technical characterization of the probes, CpGs associated with certain chromatin states, gene association, transcription factor binding sites, CpG density, etc.
Additionally, knowYourCG has support for feature selection and feature engineering, which is currently in development.
The following commands prepares the use of KnowYourCG:
Our example uses a specific mouse design group as input (PGCMeth, methylated in primoridal germ cells). First get the CG list using the following code:
## [1] "cg36615889_TC11" "cg36646136_BC21" "cg36647910_BC11" "cg36857173_TC21"
## [5] "cg36877289_BC21" "cg36899653_BC21"
Now test the enrichment over database groups. By default, KYCG will select all the categorical groups and overlapping genes (CpGs associated with a gene).
We can visualize the result of this test using the
KYCG_plotEnrichAll
function:
## Loading required namespace: ggrepel
This plot groups different database sets along the x-axis and plot
-log10(FDR) on the y-axis. As expected, the PGCMeth group itself appear
on the top of the list. But one can also find histone H3K9me3, chromHMM
Het
and transcription factor Trim28
binding
enriched in this CG group.
There are four testing scenarios depending on the type format of the
query set and database sets. They are shown with the respective testing
scenario in the table below. testEnrichment
,
testEnrichmentSEA
are for Fisher’s exact test and Set
Enrichment Analysis respectively.
Continuous.DB | Discrete.DB | |
---|---|---|
Continuous Query | Correlation-based | Set Enrichment Analysis |
Discrete Query | Set Enrichment Analysis | Fisher’s Exact Test |
The main work horse function for test enrichment of a categorical
query against categorical databases is the testEnrichment
function. This function calculates the extent of overlap and apply
different statistics for enrichment testing. The
testEnrichment()
will perform Fisher’s exact test
(one-tailed by default, but two-tailed optionally) and report metrics
about each of the the loaded database sets.
Choice of universal set: Universal set is the set of all probes for a given platform. It can either be passed in as an argument called
universeSet
or the platform name can be passed with argumentplatform
. If neither of these are supplied, the universe set will be implied from the probes.
library(SummarizedExperiment)
## prepare a query
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS", platform="MM285")
results %>% dplyr::filter(overlap>10) %>% head
## prepare another query
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS", platform="MM285")
results %>% dplyr::filter(overlap>10) %>%
dplyr::select(dbname, estimate, test, FDR) %>% head
The output of each test contains at least four variables: the estimate (fold enrichment, not the test statistics), p-value, type of test, and whether meta data is included in the tested database set (hasMeta), as well as the name of the database set and the database group. By default, the estimate column is sorted.
It should be noted that the estimate (or test statistic) is test dependent and comparison between p-values should be limited to within the same type of test. For instance, the test statistics for Fisher’s exact test and SEA are log fold change and the test statistic for Spearman’s test is simply the rank order correlation coefficient. For simplicity, we report all of the test types in one data frame.
The nQ
and nD
columns identify the length
of the query set and the database set, respectively. Often, it’s
important to examine the extent of overlap between the two sets, so that
metric is reported as well in the overlap
column.
See Supplemental Vignette for other ways of visualizing enrichment results.
The success of enrichment testing depends critically on the availability of biologically-relevant databases. To reflect the biological meaning of databases and facilitate selective testing, we have organized our database sets into different groups. Each group contains one or multiple databases. Here is how to find the names of pre-built database groups:
The KYCG_listDBGroups()
function returns a data frame
containing information of these databases. The Title column is the
accession key one needs for the testEnrichment
function.
With the accessions, one can either directly use them in the
testEnrichment
function or explicitly call the
KYCG_getDBs()
function to retrieve databases themselves.
Caching these databases on the local machine is important, for two
reasons: it limits the number of requests sent to the Bioconductor
server, and secondly it limits the amount of time the user needs to wait
when re-downloading database sets. For this reason, one should run
sesameDataCache()
before loading in any database sets. This
will take some time to download all of the database sets but this only
needs to be done once per installation. During the analysis the database
sets can be identified using these accessions. Sesame also does some
guessing when a unique substring is given. For example, the following
“MM285.designGroup” retrieves the “KYCG.MM285.designGroup.20210210”
database. Let’s look at the database group which we had used as the
query (query and database are reciprocal) in our first example:
## Selected the following database groups:
## 1. KYCG.MM285.designGroup.20210210
In total, 32 datasets have been loaded for this group. We can get the “PGCMeth” as an element of the list:
## chr [1:474] "cg36615889_TC11" "cg36646136_BC21" "cg36647910_BC11" ...
## - attr(*, "group")= chr "KYCG.MM285.designGroup.20210210"
## - attr(*, "dbname")= chr "PGCMeth"
On subsequent runs of the KYCG_getDBs()
function, the
database loading can be faster thanks to the sesameData in-memory caching, if the
corresponding database has been loaded.
A query set represents probes of interest. It may either be in the form of a character vector where the values correspond to probe IDs or a named numeric vector where the names correspond to probe IDs. The query and database definition is rather arbitrary. One can regard a database as a query and turn a query into a database, like in our first example. In real world scenario, query can come from differential methylation testing, unsupervised clustering, correlation with a phenotypic trait, and many others. For example, we could consider CpGs that show tissue-specific methylation as the query. We are getting the B-cell-specific hypomethylation.
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "B_cell"]
head(query)
## [1] "cg32668003_TC11" "cg45118317_TC11" "cg37563895_TC11" "cg46105105_BC11"
## [5] "cg47206675_TC21" "cg38855216_TC21"
This query set represents hypomethylated probes in Mouse B-cells from the MM285 platform. This specific query set has 168 probes.
A special case of set enrichment is to test whether CpGs are
associated with specific genes. Automating the enrichment test process
only works when the number of database sets is small. This is important
when targeting all genes as there are tens of thousands of genes on each
platform. By testing only those genes that overlap with the query set,
we can greatly reduce the number of tests. For this reason, the gene
enrichment analysis is a special case of these enrichment tests. We can
perform this analysis using the KYCG_buildGeneDBs()
function.
query <- names(sesameData_getProbesByGene("Dnmt3a", "MM285"))
results <- testEnrichment(query,
KYCG_buildGeneDBs(query, max_distance=100000, platform="MM285"),
platform="MM285")
results[,c("dbname","estimate","gene_name","FDR", "nQ", "nD", "overlap")]
Using these sample results, we can plot a volcano plot and lollipop plot.
As expected, we recover our targeted gene (Dnmt3a).
Some commonly used database sets are stored in ExperimentHub/sesameData package. But more database sets can be found here. You can use a convenience function to download these database sets to a local folder.
KYCG_listDBGroups(path = "~/Downloads/KYCG_knowledgebase_EPICv2")
## [1] "ABCompartment.20220911.gz" "Blacklist.20220304.gz"
## [3] "CGI.20220904.gz" "ChromHMM.20220303.gz"
## [5] "CTCFbind.20220911.gz" "HM.20221013.gz"
## [7] "ImprintingDMR.20220818.gz" "MetagenePC.20220911.gz"
## [9] "nFlankCG.20220321.gz" "PMD.20220911.gz"
## [11] "ProbeType.gz" "REMCChromHMM.20220911.gz"
## [13] "rmsk1.20220307.gz" "rmsk2.20220321.gz"
## [15] "Tetranuc2.20220321.gz" "TFBS.20220921.gz"
## [17] "TFBSrm.20221005.gz"
## load all database files in the folder
dbs <- KYCG_loadDBs("~/Downloads/KYCG_knowledgebase_EPICv2/")
## or one database file
dbs <- KYCG_loadDBs("~/Downloads/KYCG_knowledgebase_EPICv2/hg38/CGI.20220904.gz")
One can get all the genes associated with a probe set by
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
regs <- sesameData_getTxnGRanges("mm10", merge2gene = TRUE)
genes <- sesameData_annoProbes(query, regs, platform="MM285", return_ov_features=TRUE)
genes
## GRanges object with 156 ranges and 2 metadata columns:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000033014.12 chr3 103279293-103358775 + | Trim33
## ENSMUSG00000041124.11 chr9 4376562-4386870 + | Msantd4
## ENSMUSG00000071253.8 chr10 62920633-62946498 + | Slc25a16
## ENSMUSG00000016382.15 chrX 75785654-75875182 - | Pls3
## ENSMUSG00000078974.10 chr11 16500530-16508484 - | Sec61g
## ... ... ... ... . ...
## ENSMUSG00000009090.17 chr11 4986824-5042791 + | Ap1b1
## ENSMUSG00000029624.10 chr5 145147514-145167108 - | Ptcd1
## ENSMUSG00000068877.12 chr3 94693556-94704413 + | Selenbp2
## ENSMUSG00000030629.15 chr7 84613766-84689959 - | Zfand6
## ENSMUSG00000056153.15 chr18 88665224-88927481 - | Socs6
## gene_type
## <character>
## ENSMUSG00000033014.12 protein_coding
## ENSMUSG00000041124.11 protein_coding
## ENSMUSG00000071253.8 protein_coding
## ENSMUSG00000016382.15 protein_coding
## ENSMUSG00000078974.10 protein_coding
## ... ...
## ENSMUSG00000009090.17 protein_coding
## ENSMUSG00000029624.10 protein_coding
## ENSMUSG00000068877.12 protein_coding
## ENSMUSG00000030629.15 protein_coding
## ENSMUSG00000056153.15 protein_coding
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
Here we demonstrate the use of g:Profiler2 to perform Gene ontology enrichment analysis:
library(gprofiler2)
## use gene name
gostres <- gost(genes$gene_name, organism = "mmusculus")
gostres$result[order(gostres$result$p_value),]
gostplot(gostres)
## use Ensembl gene ID, note we need to remove the version suffix
gene_ids <- sapply(strsplit(names(genes),"\\."), function(x) x[1])
gostres <- gost(gene_ids, organism = "mmusculus")
gostres$result[order(gostres$result$p_value),]
gostplot(gostres)
The query may be a named continuous vector. In that case, either a gene enrichment score will be calculated (if the database is discrete) or a Spearman correlation will be calculated (if the database is continuous as well). The three other cases are shown below using biologically relevant examples.
To display this functionality, let’s load two numeric database sets individually. One is a database set for CpG density and the other is a database set corresponding to the distance of the nearest transcriptional start site (TSS) to each probe.
res <- testEnrichmentSEA(query, "MM285.seqContextN")
res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")]
The estimate here is enrichment score.
NOTE: Negative enrichment score suggests enrichment of the categorical database with the higher values (in the numerical database). Positive enrichment score represent enrichment with the smaller values. As expected, the designed TSS CpGs are significantly enriched in smaller TSS distance and higher CpG density.
One can plot the set enrichment analysis result by
prepPlot=TRUE
command followed by calling the
KYCG_plotSetEnrichment()
function.
query <- KYCG_getDBs("KYCG.MM285.designGroup")[["TSS"]]
db <- KYCG_getDBs("MM285.seqContextN", "distToTSS")
res <- testEnrichmentSEA(query, db, prepPlot = TRUE)
KYCG_plotSetEnrichment(res[[1]])
Alternatively one can test the enrichment of a continuous query with discrete databases. Here we will use the methylation level from a sample as the query and test it against the chromHMM chromatin states.
beta_values <- getBetas(sesameDataGet("MM285.1.SigDF"))
res <- testEnrichmentSEA(beta_values, "MM285.chromHMM")
res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")]
As expected, chromatin states Tss
, Enh
has
negative enrichment score, meaning these databases are associated with
small values of the query (DNA methylation level). On the contrary,
Quies
states are associated with high methylation
level.
Methylation Correlation Network Analysis is motivated by gene expression methods that seek to identify modules and networks from expression data. Genes that are co-expressed behave similarly over different environments and thus may share similar biological function and participate in functional networks. Similarly, CpGs that have highly correlated methylation fractions across different environments may be involved in common epigenetic and biological pathways.
Two highly correlated CpGs belonging to the same module. Across 256 samples from different tissues, ages and sexes, methylation changes at one CpG is accompanied by similar changes in a separate co-methylated CpG
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.5.1 tibble_3.2.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] IRanges_2.41.1 S4Vectors_0.45.2
## [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [11] knitr_1.49 sesame_1.25.1
## [13] sesameData_1.25.0 ExperimentHub_2.15.0
## [15] AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [17] dbplyr_2.5.0 BiocGenerics_0.53.3
## [19] generics_0.1.3
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4
## [4] blob_1.2.4 filelock_1.0.3 Biostrings_2.75.1
## [7] fastmap_1.2.0 digest_0.6.37 lifecycle_1.0.4
## [10] KEGGREST_1.47.0 RSQLite_2.3.8 magrittr_2.0.3
## [13] compiler_4.5.0 rlang_1.1.4 sass_0.4.9
## [16] tools_4.5.0 utf8_1.2.4 yaml_2.3.10
## [19] labeling_0.4.3 S4Arrays_1.7.1 bit_4.5.0
## [22] curl_6.0.1 DelayedArray_0.33.2 plyr_1.8.9
## [25] RColorBrewer_1.1-3 abind_1.4-8 BiocParallel_1.41.0
## [28] withr_3.0.2 purrr_1.0.2 grid_4.5.0
## [31] preprocessCore_1.69.0 fansi_1.0.6 wheatmap_0.2.0
## [34] colorspace_2.1-1 scales_1.3.0 cli_3.6.3
## [37] rmarkdown_2.29 crayon_1.5.3 reshape2_1.4.4
## [40] httr_1.4.7 tzdb_0.4.0 DBI_1.2.3
## [43] cachem_1.1.0 stringr_1.5.1 zlibbioc_1.53.0
## [46] parallel_4.5.0 AnnotationDbi_1.69.0 BiocManager_1.30.25
## [49] XVector_0.47.0 vctrs_0.6.5 Matrix_1.7-1
## [52] jsonlite_1.8.9 hms_1.1.3 ggrepel_0.9.6
## [55] bit64_4.5.2 jquerylib_0.1.4 glue_1.8.0
## [58] codetools_0.2-20 stringi_1.8.4 gtable_0.3.6
## [61] BiocVersion_3.21.1 UCSC.utils_1.3.0 munsell_0.5.1
## [64] pillar_1.9.0 rappdirs_0.3.3 htmltools_0.5.8.1
## [67] GenomeInfoDbData_1.2.13 R6_2.5.1 evaluate_1.0.1
## [70] lattice_0.22-6 readr_2.1.5 png_0.1-8
## [73] memoise_2.0.1 bslib_0.8.0 Rcpp_1.0.13-1
## [76] SparseArray_1.7.2 xfun_0.49 pkgconfig_2.0.3