Today we’ll cover statistical concepts and tests commonly used in cancer research. The dataset we’ll access is a subset of the ALL expression data whose patient information we worked with in the first day’s material. In addition to that information we’ll access 1000 associated expression microarray features that present the highest variance across the patient samples. The data have been saved in a binary format to reduce file sizes.
The following continues to explore the ‘ALL’ phenotypic data from this earlier in the course. Start by loading the data, as before.
path <- file.choose() # look for ALLphenoData.tsv
stopifnot(file.exists(path))
pdata <- read.csv(path)
Let’s look a little more closely at patient information in the pdata
object:
median(pdata$age)
## [1] NA
The value NA
shows up because some of the pdata$age
values are NA. We can verify this by asking R if there are any NA values
any(is.na(pdata$age))
## [1] TRUE
anyNA(pdata$age) # same, but more efficient
## [1] TRUE
Consulting the help page for ?median
suggests a solution – specify the argument na.rm=TRUE
. Explore other aspects of age, like range()
and quantile()
.
median(pdata$age, na.rm=TRUE)
## [1] 29
range(pdata$age, na.rm=TRUE)
## [1] 5 58
quantile(pdata$age, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 5.0 19.0 29.0 45.5 58.0
Some simple plots of patient ages – note the nested functions!
plot(pdata$age)
plot(sort(pdata$age))
sortedAge = sort(pdata$age)
?plot
Exercise: Plot the sortedAge
with markers at each data point and connect the points with red lines. You’ll need to use the graphics parameters type="b"
and col="red"
, see ?plot.default
.
Exercise: Plot one variable (e.g., age
) as a function of another (e.g., sex
). Since sex
is a factor, R chooses to create a box plot; does this make sense?
plot(age ~ sex, pdata)
Histograms, and their display options:
hist(pdata$age)
?hist
hist(pdata$age, br=25)
Cross tables use formulas
to describe the relationship between the data they present:
xtabs(~sex, data=pdata, exclude=NA)
## sex
## F M
## 42 83
xtabs(~sex + remission, data=pdata, exclude=NA)
## remission
## sex CR REF
## F 27 7
## M 71 8
kinet
) males (sex
) are refractory (remission
)?Use plot()
to visualize the distribution of female and male ages in the pdata
data set.
plot(age ~ sex, pdata)
It looks like females are on average older than males. Use t.test()
to find out.
t.test(age ~ sex, pdata)
##
## Welch Two Sample t-test
##
## data: age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.022660 9.504142
## sample estimates:
## mean in group F mean in group M
## 35.16667 30.92593
Check out the help page for t.test()
?t.test
What are all those additional arguments to t.test()
? For example, what is the meaning of the var.equal
argument? Why are there 79.88 degrees of freedom?
t.test(age ~ sex, pdata, var.equal=TRUE)
##
## Two Sample t-test
##
## data: age by sex
## t = 1.6266, df = 121, p-value = 0.1064
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9207109 9.4021924
## sample estimates:
## mean in group F mean in group M
## 35.16667 30.92593
A t-test can also be viewed as an analysis of variance (ANOVA); analysis of variance is a form of linear model. Use lm()
to fit a linear model that describes how age changes with sex; the anova()
function summarizes the linear model in a perhaps more familiar ANOVA table.
(fit <- lm(age ~ sex, pdata))
##
## Call:
## lm(formula = age ~ sex, data = pdata)
##
## Coefficients:
## (Intercept) sexM
## 35.167 -4.241
anova(fit)
## Analysis of Variance Table
##
## Response: age
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 497.4 497.41 2.6459 0.1064
## Residuals 121 22747.4 187.99
What kinds of assumptions are being made in the linear model, e.g., about equality of variances? Try plotting fit
; what are the figures trying to tell you?
plot(fit)
fit
is an example of an R object. Find out it’s class
class(fit)
## [1] "lm"
plot
is an example of an R generic; it has different methods implemented for different classes of objects. Use methods()
to see available methods
methods(plot)
## [1] plot.HoltWinters* plot.TukeyHSD* plot.acf* plot.data.frame*
## [5] plot.decomposed.ts* plot.default plot.dendrogram* plot.density*
## [9] plot.ecdf plot.factor* plot.formula* plot.function
## [13] plot.hclust* plot.histogram* plot.isoreg* plot.lm*
## [17] plot.medpolish* plot.mlm* plot.ppr* plot.prcomp*
## [21] plot.princomp* plot.profile.nls* plot.raster* plot.spec*
## [25] plot.stepfun plot.stl* plot.table* plot.ts
## [29] plot.tskernel*
## see '?methods' for accessing help and source code
Look up the help page for the plot
generic, lm
method with
?plot.lm
Fitted models can be used in other functions, for instance to predict values for new data. Construct a data.frame
with a single column sex
with values "M"
and "F"
. Consult the help page for the predict.lm
method, and calculate the expected value of the fitted model for males and for females.
df = data.frame(sex=c("F", "M"))
predict(fit, df)
## 1 2
## 35.16667 30.92593
What do the predicted values correspond to in the t.test()
? Use coefficients()
to extract the coefficients of the fitted model.
coefficients(fit)
## (Intercept) sexM
## 35.166667 -4.240741
Interpret the (Intercept)
and sexM
coefficients in terms of female and male ages.
The article from which the pdata
object is derived states that “Although chromosome translocations and molecular rearrangements are relatively infrequent in T-lineage ALL, these events occur commonly in B-lineage ALL and reflect distinct mechanisms of transformation”. Let’s investigate this statement.
The relevant columns of data are summarized as
summary(pdata[,c("BT", "cyto.normal")])
## BT cyto.normal
## B2 :36 Mode :logical
## B3 :23 FALSE:69
## B1 :19 TRUE :24
## T2 :15 NA's :35
## B4 :12
## T3 :10
## (Other):13
Simplify the number of BT
levels by creating a new column with B
or T
. Do this by creating a substr()
ing from the original column, consisting of the first letter of each entry, and turning this vector into a factor()
. Assign it to a new column name
pdata$BorT <- factor(substr(pdata$BT, 1, 1))
Cross-tabulate the data
xtabs(~ BorT + cyto.normal, pdata)
## cyto.normal
## BorT FALSE TRUE
## B 56 17
## T 13 7
The data are qualitatively consistent with the statement that molecular rearrangements are more common in B-lineage ALL. Let’s test this with a chi-squared test
chisq.test(pdata$BorT, pdata$cyto.normal)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: pdata$BorT and pdata$cyto.normal
## X-squared = 0.59622, df = 1, p-value = 0.44
Interpret the results. What about additional parameters documented on ?chisq.test
?
The 128 samples we’ve been exploring are from a micro-array experiment. Microarrays consist of ‘probesets’ that interogate genes for their level of expression. In the experiment we’re looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we’ll work with has been pre-processed.
Start by finding the expression data file on disk.
path <- file.choose() # look for ALL-expression.csv
stopifnot(file.exists(path))
The data is stored in ‘comma-separate value’ format, with each probeset occupying a line, and the expression value for each sample in that probeset separated by a comma. Input the data using read.csv()
. There are three challenges:
row.names=1
to read.csv()
.check.colnames=FALSE
to over-ride R’s default.read.csv()
returns a data.frame
. We could use a data.frame
to work with our data, but really it is a matrix()
– the columns are of the same type and measure the same thing. Use as.matrix()
to coerce the data.frame
we input to a matrix
.exprs <- read.csv(path, row.names=1, check.names=FALSE)
exprs <- as.matrix(exprs)
We’ll also make use of the data describing the samples
path <- file.choose() # look for ALL-phenoData.csv
stopifnot(file.exists(path))
pdata <- read.csv(path, row.names=1)
We’ll add a column to pdata
, derived from the BT
column, to indicate whether the sample is B-cell or T-cell ALL.
pdata$BorT <- factor(substr(pdata$BT, 1, 1))
xtabs(~BorT + BT, pdata)
## BT
## BorT B B1 B2 B3 B4 T T1 T2 T3 T4
## B 5 19 36 23 12 0 0 0 0 0
## T 0 0 0 0 0 5 1 15 10 2
Some of the results below involve plots, and it’s convenient to choose pretty and functional colors. We use the RColorBrewer package; see colorbrewer.com
library(RColorBrewer)
divergent <- brewer.pal(11, "RdBu")
highlight <- brewer.pal(3, "Set2")[1:2]
‘divergent’ is a vector of colors that go from red (negative) to blue (positive). `highlight’ is a vector of length 2, light and dark green.
For more options see ?RColorBrewer
and to view the predefined palettes display.brewer.all()
Verify that we have a matrix of appropriate class()
and dim()
ensions; take a peak at the first five rows and columns of the data.
class(exprs)
## [1] "matrix"
dim(exprs)
## [1] 12625 128
exprs[1:5, 1:5]
## 01005 01010 03002 04006 04007
## 1000_at 7.597323 7.479445 7.567593 7.384684 7.905312
## 1001_at 5.046194 4.932537 4.799294 4.922627 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
## 1004_at 5.925260 5.912780 5.893209 6.170245 5.615210
We’ll work with a subset of the data, specifically the 1000 probesets that show the most variance across samples. The variance of a single row can be calculated as
var(exprs[1,])
## [1] 0.06768013
The variance across each row can be calculated using the apply()
function. The first argument is the rectangular data structure that we’d like to summarize, the second argument is the dimension (1
for rows; 2
for columns) over which we’d like to summarize, the third argument is the function that we’d like to apply to each row or column. Visualize the distribution of probeset variances using hist()
, perhaps on transformed data.
v <- apply(exprs, 1, var)
hist(sqrt(v))
Use order(v, decreasing=TRUE)
to determine how the elements of v
should be selected so that they are ordered from highest to lowest, visually verify that the values are actually ordered from highest to lowest…
o <- order(v, decreasing=TRUE)
plot(sqrt(v[o]))
Use head()
to select the indexes of the 1000 most variable probesets, and to create a subset of the exprs
matrix containing these probesets
exprs1000 <- exprs[head(o, 1000), ]
apply()
on exprs1000
to verify that we appear to have the 1000 most variable probesets.dist()
and hclust()
We’d like to understand whether there are patterns in the data, e.g., consistent differences in overall patterns of expression between samples. One approach is hierarchical clustering
Calculate the Euclidean distance between samples using dist()
on the t()
ranspose of exprs1000
(check out ?dist
for other distance metrics).
d <- dist(t(exprs1000))
Use hclust()
to cluster the data, and plot()
to visualize the clusters.
plot(hclust(d))
It’s informative to label the columns of exprs1000
with the ALL subtype (e.g., B1, B2, T1, etc). First, use identical()
to verify that the colnames()
of exprs1000
are the same as the rownames()
of pdata
.
identical(colnames(exprs1000), rownames(pdata))
## [1] TRUE
Now replace the column names of exprs1000
with pdata$BT
, and repeat the clustering.
colnames(exprs1000) <- pdata$BT
plot(hclust(dist(t(exprs1000))))
colnames(exprs1000) <- rownames(pdata)
It’s clear that overall gene expression patterns differ between B- and T-cell ALL.
The data supporting the dendrogram produced by hclust()
can be visualized as a heatmap()
. Here we calculate a correlation matrix between each sample; we will use this as the basis for our distance metric in ordering rows and columns of the heatmap.
d <- dist(t(exprs1000))
The following heatmap()
command performs hierarchical clustering of the distances, and adds a color bar to indicate whether the sample came from a B or T lineage sample.
color <- highlight[pdata$BorT]
heatmap(as.matrix(d), ColSideColor=color, col=divergent, symm=TRUE)
cmdscale()
and other approaches to dimension reductionWe have characterized our 128 samples in 1000 dimensions. Hierarchical clustering suggests that somehow there is pattern in this high-dimensional data. There are a number of statistical techniques that we can use to effectively reduce the dimensions of the data. The reduced-dimension representation might be used for visualization or downstream analysis.
One method for reducing the dimension of the data is classical multi-dimensional scaling (principle coordinates analysis). The idea is to take a measure of dissimilarity (like the distance matrix calculated above) and reduce it to a set of points (e.g., in 2 dimensions) such that the distance between points is approximately equal to the dissimilarity. This is very easy to implement in R, using the cmdscale()
function.
mds <- cmdscale(dist(t(exprs1000)))
plot(mds, col=color)
mds
is a 128 x 2 matrix, and the columns can be used as a reduced-dimension replacement for the 1000 expression values.
Principle components analysis is a similar approach to visualization and down-stream analysis in reduced dimensions; see prcomp()
.
knn()
Classification (‘supervised machine learning’) assigns ‘test’ samples into a group based on some measure of similarity to a set of ‘training’ samples. There are many varieties of classification. knn (‘k’ nearest neighbors) classification assigns the test sample to the majority-vote of it’s k nearest neighbors. knn()
and other common classifiers are defined in the class
package, so we start by adding that to our R session.
library(class)
We’ll start by creating a training set. We’ll choose the first half of the ‘B’ individuals and the first half of the ‘T’ individuals as members. First, we can get the index of each row using seq_along()
seq_along(pdata$BorT)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## [70] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [93] 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
## [116] 116 117 118 119 120 121 122 123 124 125 126 127 128
These values can be split into groups (a list-of-integers) using split()
idxByType <- split(seq_along(pdata$BorT), pdata$BorT)
idxByType
## $B
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## [65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
##
## $T
## [1] 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [25] 120 121 122 123 124 125 126 127 128
We can ask about the number of each element of idxByType
by extracting the element and using the length()
function, e.g., length(idxByType[["B"]])
. We can then use head()
to select the first half of the elements
Bidx <- idxByType[["B"]]
BTrainIdx <- head(Bidx, length(Bidx) / 2)
Likewise for T:
Tidx <- idxByType[["T"]]
TTrainIdx <- head(Tidx, length(Tidx) / 2)
There are two fun things to notice, and these help us expand our knowledge of R.
Notice that we are applying the same sequence of operations to the B and to the T indexes. R lets us write a function to capture this repeated operation
firstHalf <- function(x) {
head(x, length(x) / 2)
}
and we can verify that this works
firstHalf(idxByType[["B"]])
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
firstHalf(idxByType[["T"]])
## [1] 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
We’d really like to apply this to function to each element of the list idxByType
. This can be done using lapply()
, where the first argument is a vector and the second argument is the function we’d like to apply
trainIdx <- lapply(idxByType, firstHalf)
trainIdx
## $B
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
##
## $T
## [1] 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
OK, let’s compose our training set from the first half of each group using cbind()
to create a single matrix of training individuals
Bhalf <- exprs1000[, trainIdx[["B"]] ]
Thalf <- exprs1000[, trainIdx[["T"]] ]
train <- cbind(Bhalf, Thalf)
Our test set is the individuals not included in the training set. One way to select these individuals is to write a secondHalf()
function; another way is to subset exprs1000
to include the columns that are not in train
test <- exprs1000[, ! colnames(exprs1000) %in% colnames(train) ]
We’ll need to know the classification of the training and testing sets
cl_train <- pdata[colnames(train), "BorT"]
cl_test <- pdata[colnames(test), "BorT"]
knn()
is invoked with the training and test sets, and the known classification of the training set. The argument k
is the number of nearest neighbors to consider; the default is k=1
. The return value is the estimated classification of each test sample
cl_test_est <- knn(t(train), t(test), cl_train)
A characterization of how well the classifier performs is the ‘confusion matrix’, which summarizes the known versus estimated classification of the test individuals in a cross-tabulation
xtabs(~cl_test + cl_test_est)
## cl_test_est
## cl_test B T
## B 48 0
## T 0 17
Here, the classifier is performing very well – there are no incorrect (off-diagonal) assignments.
We’ve been using the known data to create a classifier and then to evaluate it’s performance. The decisions to use half of the B and half of the T samples, and to include the first half of each group in the ‘training’ group, were arbitrary, and it turns out that a better strategy is ‘leave-one-out’ cross-validation. This works by using all but one sample (e.g., the first sample) in the training group, and then classifying the sample that was left out. One then repeats for the second, third, … samples. R implements this for knn with the knn.cv()
function (again with a default k=1
nearest neighbors).
knn_loo <- knn.cv(t(exprs1000), pdata$BorT)
xtabs(~pdata$BorT + knn_loo)
## knn_loo
## pdata$BorT B T
## B 95 0
## T 0 33
As an exercise, use knn.cv()
to develop a classifier for T-cell subtypes; explore the consequences of different values of k
. Here’s a sketch of the approach:
Create a subset of the expression and phenotype data with only the T cell types; remember to use factor()
to remove unused levels
keep <- pdata$BorT == "T"
exprT <- exprs1000[, keep]
classT <- factor(pdata$BT[keep])
Use knn.cv()
to perform the cross-validation, and xtabs()
to summarize the confusion matrix. Interpret the confusion matrix.
xtabs(~classT + knn.cv(t(exprT), classT))
## knn.cv(t(exprT), classT)
## classT T T1 T2 T3 T4
## T 0 0 4 1 0
## T1 0 0 1 0 0
## T2 5 0 4 3 3
## T3 1 0 2 6 1
## T4 0 0 0 1 1
Experiment with different values of k
. Can you interpret the consequences of k
for the statistical principles of precision and accuracy?
There are very flexible ways of working with data, e.g., 1- and 2-dimensional subsetting with [
, accessing columns of a data.frame with $
or with two-dimensional subsetting, creating factor()
s, etc.
Univariate statistical functions, e.g., mean()
, median()
, t.test()
, lm()
, chisq.test()
.
Multivariate statistical functions, e.g., dist()
, hclust()
, heatmap()
, knn()
, knn.cv()
.
Packages provide specialized functionality, e.g., RColorBrewer, [class][].
Built-in functions can process rows or columns of a matrix (apply()
) or list elements (lapply()
; see also sapply()
, vapply()
, mapply()
, tappyl()
).
We can write our own functions!
The help system is our friend!