The R source program dfHclust.R
is in the github repository for lab 7. Source it into your R session, and then verify that it works with the call
data(mtcars)
dfHclust(mtcars, labels=rownames(mtcars))
If it fails, add any missing libraries, and do what it takes to get it to work. Interrupt the shiny session to proceed.
Set up inputs to dfHclust
using the tissuesGeneExpression
data.
library(tissuesGeneExpression)
data(tissuesGeneExpression)
df = data.frame(t(e))
no = which(tab$SubType == "normal")
df = df[no,]
tisslabel = tab$Tissue[no]
Use dfHclust(df[,1:50], tisslabel)
as a check. Interrupt
Map the column names of df
to gene symbols. Use hgu133a.db
. Remove columns with unmappable symbols and rename the remaining columns with the symbols.
library(hgu133a.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: org.Hs.eg.db
##
##
nids = mapIds(hgu133a.db, keys=
sub("^X", "", colnames(df)), keytype="PROBEID", column="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
bad = which(is.na(nids))
if (length(bad)>0) {
df = df[,-bad]
nids = nids[-bad]
colnames(df) = nids
}
dim(df)
## [1] 85 21112
Interrupt the shiny session and use the new df
as input. Note that the clustering is based on three genes by default. Other default choices for the clustering are - the object:object distance used - the agglomeration algorithm - the height at which the tree is cut to define clusters
Shift the view to the silhouette
plot. With the default settings, the average silhouette value for five clusters is 0.35.
Increase the height for cut
value to 8. How many clusters are declared, and what is the average silhouette value?
Add the gene CCL5 to the feature set used for clustering. Now how many clusters are declared?
Interrupt the shiny session to proceed.
Modify df
so that the column names are in alphabetical order. Use dfHclust(df[,1:50], tisslabel)
for the new ordering. What is the average silhouette value for the default choices of dfHclust
settings?
Change the clustering method to ward.D2
. What is the new average silhouette value?
Interrupt the shiny session to proceed.
We have used an arbitrary selection of genes for these illustrations. Consider the idea that steady-state expression pattern of genes that are used to perform splicing is important for tissue differentiation. We can get a list of relevant genes on the hgu133a array as follows.
# using GO.db
GOID TERM
22097 GO:0045292 mRNA cis splicing, via spliceosome
splg = select(hgu133a.db, keys="GO:0045292", keytype="GO", columns="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
tokeep = intersect(splg$SYMBOL, colnames(df))
dfsp = df[, tokeep]
You should have 7 genes available after these operations. Use dfsp
with dfHclust, and select all genes for clustering.
As you add spliceosome-annotated genes into the clustering, does the appearance of the clustering tree improve?
Install and attach the drosmap
package. This is a simple repackaging of code and data provided at BDGP.
library(BiocInstaller)
biocLite("vjcitn/drosmap")
library(drosmap)
A data.frame of spatially recorded gene expression patterns derived from blastocyst samples is available. We’ll display some examples.
library(drosmap)
## Loading required package: NMF
## Loading required package: pkgmaker
## Loading required package: registry
##
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:S4Vectors':
##
## new2
## The following object is masked from 'package:base':
##
## isNamespaceLoaded
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4
## To enable shared memory capabilities, try: install.extras('
## NMF
## ')
##
## Attaching package: 'NMF'
## The following objects are masked from 'package:S4Vectors':
##
## compare, nrun
## Loading required package: knitcitations
## Loading required package: grid
## Loading required package: png
data(expressionPatterns)
data(template)
imageBatchDisplay(expressionPatterns[,1:12],
nrow=3, ncol=4, template=template[,-1])
We’ll reduce the data matrix (for convenience) to 701 unique genes
data(uniqueGenes)
uex = expressionPatterns[,uniqueGenes]
We’ll begin with a factorization using a basis of rank 10.
set.seed(123)
library(NMF)
m10 =nmf(uex, rank=10)
m10
## <Object of class: NMFfit>
## # Model:
## <Object of class:NMFstd>
## features: 405
## basis/rank: 10
## samples: 701
## # Details:
## algorithm: brunet
## seed: random
## RNG: 403L, 624L, ..., 2099891502L [e38d032700af470a3a1013304e0fcab6]
## distance metric: 'KL'
## residuals: 4901.298
## Iterations: 2000
## Timing:
## user system elapsed
## 125.648 0.024 125.697
The authors of the Wu et al. 2016 PNAS paper justify a rank 21 basis.
set.seed(123)
library(NMF)
m21 =nmf(uex, rank=21)
To visualize the clustering of the expression patterns with the rank 10 basis, use
imageBatchDisplay(basis(m10), nrow=4,ncol=3,template=template[,-1])
The ‘predicted’ matrix with the rank 10 basis is
PM10 = basis(m10)%*%coef(m10)
Compare the faithfulness of the rank 10 and rank 21 approximations.
Produce the display of the m21
basis with imageBatchDisplay
and check that the constituents are similar to those shown as principal patterns below (from the Wu et al. paper).
Can any of the patterns found with the rank 10 basis be mapped to key anatomical components of the blastocyst fate schematic?