Today we’ll cover statistical concepts and tests commonly used in cancer research. The dataset we’ll access is a subset of the ALL expression data whose patient information we worked with in the first day’s material. In addition to that information we’ll access 1000 associated expression microarray features that present the highest variance across the patient samples. The data have been saved in a binary format to reduce file sizes.

1 Univariate Analysis

The following continues to explore the ‘ALL’ phenotypic data from this earlier in the course. Start by loading the data, as before.

path <- file.choose()    # look for ALLphenoData.tsv

stopifnot(file.exists(path))
pdata <- read.csv(path)

1.1 Descriptive statistics

Let’s look a little more closely at patient information in the pdata object:

median(pdata$age)

## [1] NA

The value NA shows up because some of the pdata$age values are NA. We can verify this by asking R if there are any NA values

any(is.na(pdata$age))

## [1] TRUE

anyNA(pdata$age)        # same, but more efficient

## [1] TRUE

Consulting the help page for ?median suggests a solution – specify the argument na.rm=TRUE. Explore other aspects of age, like range() and quantile().

median(pdata$age, na.rm=TRUE)

## [1] 29

range(pdata$age, na.rm=TRUE)

## [1]  5 58

quantile(pdata$age, na.rm=TRUE)

##   0%  25%  50%  75% 100% 
##  5.0 19.0 29.0 45.5 58.0

Some simple plots of patient ages – note the nested functions!

plot(pdata$age)
plot(sort(pdata$age))
sortedAge = sort(pdata$age)
?plot

Exercise: Plot the sortedAge with markers at each data point and connect the points with red lines. You’ll need to use the graphics parameters type="b" and col="red", see ?plot.default.
Exercise: Plot one variable (e.g., age) as a function of another (e.g., sex). Since sex is a factor, R chooses to create a box plot; does this make sense?
```
plot(age ~ sex, pdata)
```

Histograms, and their display options:

hist(pdata$age)
?hist
hist(pdata$age, br=25)

Cross tables use formulas to describe the relationship between the data they present:

xtabs(~sex, data=pdata, exclude=NA)

## sex
##  F  M 
## 42 83

xtabs(~sex + remission, data=pdata, exclude=NA)

##    remission
## sex CR REF
##   F 27   7
##   M 71   8

Exercise: How many hyperdiploid (kinet) males (sex) are refractory (remission)?

1.2 t-tests

Use plot() to visualize the distribution of female and male ages in the pdata data set.

plot(age ~ sex, pdata)

It looks like females are on average older than males. Use t.test() to find out.

t.test(age ~ sex, pdata)

## 
##  Welch Two Sample t-test
## 
## data:  age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.022660  9.504142
## sample estimates:
## mean in group F mean in group M 
##        35.16667        30.92593

Check out the help page for t.test()

?t.test

What are all those additional arguments to t.test()? For example, what is the meaning of the var.equal argument? Why are there 79.88 degrees of freedom?

t.test(age ~ sex, pdata, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  age by sex
## t = 1.6266, df = 121, p-value = 0.1064
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9207109  9.4021924
## sample estimates:
## mean in group F mean in group M 
##        35.16667        30.92593

1.3 Linear models

A t-test can also be viewed as an analysis of variance (ANOVA); analysis of variance is a form of linear model. Use lm() to fit a linear model that describes how age changes with sex; the anova() function summarizes the linear model in a perhaps more familiar ANOVA table.

(fit <- lm(age ~ sex, pdata))

## 
## Call:
## lm(formula = age ~ sex, data = pdata)
## 
## Coefficients:
## (Intercept)         sexM  
##      35.167       -4.241

anova(fit)

## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq F value Pr(>F)
## sex         1   497.4  497.41  2.6459 0.1064
## Residuals 121 22747.4  187.99

What kinds of assumptions are being made in the linear model, e.g., about equality of variances? Try plotting fit; what are the figures trying to tell you?

plot(fit)

fit is an example of an R object. Find out it’s class

class(fit)

## [1] "lm"

plot is an example of an R generic; it has different methods implemented for different classes of objects. Use methods() to see available methods

methods(plot)

##  [1] plot.HoltWinters*   plot.TukeyHSD*      plot.acf*           plot.data.frame*   
##  [5] plot.decomposed.ts* plot.default        plot.dendrogram*    plot.density*      
##  [9] plot.ecdf           plot.factor*        plot.formula*       plot.function      
## [13] plot.hclust*        plot.histogram*     plot.isoreg*        plot.lm*           
## [17] plot.medpolish*     plot.mlm*           plot.ppr*           plot.prcomp*       
## [21] plot.princomp*      plot.profile.nls*   plot.raster*        plot.spec*         
## [25] plot.stepfun        plot.stl*           plot.table*         plot.ts            
## [29] plot.tskernel*     
## see '?methods' for accessing help and source code

Look up the help page for the plot generic, lm method with

?plot.lm

Fitted models can be used in other functions, for instance to predict values for new data. Construct a data.frame with a single column sex with values "M" and "F". Consult the help page for the predict.lm method, and calculate the expected value of the fitted model for males and for females.

df = data.frame(sex=c("F", "M"))
predict(fit, df)

##        1        2 
## 35.16667 30.92593

What do the predicted values correspond to in the t.test()? Use coefficients() to extract the coefficients of the fitted model.

coefficients(fit)

## (Intercept)        sexM 
##   35.166667   -4.240741

Interpret the (Intercept) and sexM coefficients in terms of female and male ages.

1.4 Chi-squared tests

The article from which the pdata object is derived states that “Although chromosome translocations and molecular rearrangements are relatively infrequent in T-lineage ALL, these events occur commonly in B-lineage ALL and reflect distinct mechanisms of transformation”. Let’s investigate this statement.

The relevant columns of data are summarized as

summary(pdata[,c("BT", "cyto.normal")])

##        BT     cyto.normal    
##  B2     :36   Mode :logical  
##  B3     :23   FALSE:69       
##  B1     :19   TRUE :24       
##  T2     :15   NA's :35       
##  B4     :12                  
##  T3     :10                  
##  (Other):13

Simplify the number of BT levels by creating a new column with B or T. Do this by creating a substr()ing from the original column, consisting of the first letter of each entry, and turning this vector into a factor(). Assign it to a new column name

pdata$BorT <- factor(substr(pdata$BT, 1, 1))

Cross-tabulate the data

xtabs(~ BorT + cyto.normal, pdata)

##     cyto.normal
## BorT FALSE TRUE
##    B    56   17
##    T    13    7

The data are qualitatively consistent with the statement that molecular rearrangements are more common in B-lineage ALL. Let’s test this with a chi-squared test

chisq.test(pdata$BorT, pdata$cyto.normal)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  pdata$BorT and pdata$cyto.normal
## X-squared = 0.59622, df = 1, p-value = 0.44

Interpret the results. What about additional parameters documented on ?chisq.test?

2 Multivariate Analysis: Machine Learning

The 128 samples we’ve been exploring are from a micro-array experiment. Microarrays consist of ‘probesets’ that interogate genes for their level of expression. In the experiment we’re looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we’ll work with has been pre-processed.

Start by finding the expression data file on disk.

path <- file.choose()          # look for ALL-expression.csv
stopifnot(file.exists(path))

The data is stored in ‘comma-separate value’ format, with each probeset occupying a line, and the expression value for each sample in that probeset separated by a comma. Input the data using read.csv(). There are three challenges:

The row names are present in the first column of the data. Tell R this by adding the argument row.names=1 to read.csv().
By default, R checks that column names do not look like numbers, but our column names do look like numbers. Use the argument check.colnames=FALSE to over-ride R’s default.
read.csv() returns a data.frame. We could use a data.frame to work with our data, but really it is a matrix() – the columns are of the same type and measure the same thing. Use as.matrix() to coerce the data.frame we input to a matrix.

exprs <- read.csv(path, row.names=1, check.names=FALSE)
exprs <- as.matrix(exprs)

We’ll also make use of the data describing the samples

path <- file.choose()         # look for ALL-phenoData.csv
stopifnot(file.exists(path))
pdata <- read.csv(path, row.names=1)

We’ll add a column to pdata, derived from the BT column, to indicate whether the sample is B-cell or T-cell ALL.

pdata$BorT <- factor(substr(pdata$BT, 1, 1))
xtabs(~BorT + BT, pdata)

##     BT
## BorT  B B1 B2 B3 B4  T T1 T2 T3 T4
##    B  5 19 36 23 12  0  0  0  0  0
##    T  0  0  0  0  0  5  1 15 10  2

Some of the results below involve plots, and it’s convenient to choose pretty and functional colors. We use the RColorBrewer package; see colorbrewer.com

library(RColorBrewer)
divergent <- brewer.pal(11, "RdBu")
highlight <- brewer.pal(3, "Set2")[1:2]

‘divergent’ is a vector of colors that go from red (negative) to blue (positive). `highlight’ is a vector of length 2, light and dark green.

For more options see ?RColorBrewer and to view the predefined palettes display.brewer.all()

2.1 Preliminary exploration and cleanup

Verify that we have a matrix of appropriate class() and dim()ensions; take a peak at the first five rows and columns of the data.

class(exprs)

## [1] "matrix"

dim(exprs)

## [1] 12625   128

exprs[1:5, 1:5]

##              01005    01010    03002    04006    04007
## 1000_at   7.597323 7.479445 7.567593 7.384684 7.905312
## 1001_at   5.046194 4.932537 4.799294 4.922627 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
## 1004_at   5.925260 5.912780 5.893209 6.170245 5.615210

We’ll work with a subset of the data, specifically the 1000 probesets that show the most variance across samples. The variance of a single row can be calculated as

var(exprs[1,])

## [1] 0.06768013

The variance across each row can be calculated using the apply() function. The first argument is the rectangular data structure that we’d like to summarize, the second argument is the dimension (1 for rows; 2 for columns) over which we’d like to summarize, the third argument is the function that we’d like to apply to each row or column. Visualize the distribution of probeset variances using hist(), perhaps on transformed data.

v <- apply(exprs, 1, var)
hist(sqrt(v))

Use order(v, decreasing=TRUE) to determine how the elements of v should be selected so that they are ordered from highest to lowest, visually verify that the values are actually ordered from highest to lowest…

o <- order(v, decreasing=TRUE)
plot(sqrt(v[o]))

Use head() to select the indexes of the 1000 most variable probesets, and to create a subset of the exprs matrix containing these probesets

exprs1000 <- exprs[head(o, 1000), ]

Class exercise: Use apply() on exprs1000 to verify that we appear to have the 1000 most variable probesets.

2.2 Clustering: `dist()` and `hclust()`

We’d like to understand whether there are patterns in the data, e.g., consistent differences in overall patterns of expression between samples. One approach is hierarchical clustering

Calculate the Euclidean distance between samples using dist() on the t()ranspose of exprs1000 (check out ?dist for other distance metrics).

d <- dist(t(exprs1000))

Use hclust() to cluster the data, and plot() to visualize the clusters.

plot(hclust(d))

It’s informative to label the columns of exprs1000 with the ALL subtype (e.g., B1, B2, T1, etc). First, use identical() to verify that the colnames() of exprs1000 are the same as the rownames() of pdata.

identical(colnames(exprs1000), rownames(pdata))

## [1] TRUE

Now replace the column names of exprs1000 with pdata$BT, and repeat the clustering.

colnames(exprs1000) <- pdata$BT
plot(hclust(dist(t(exprs1000))))

colnames(exprs1000) <- rownames(pdata)

It’s clear that overall gene expression patterns differ between B- and T-cell ALL.

The data supporting the dendrogram produced by hclust() can be visualized as a heatmap(). Here we calculate a correlation matrix between each sample; we will use this as the basis for our distance metric in ordering rows and columns of the heatmap.

d <- dist(t(exprs1000))

The following heatmap() command performs hierarchical clustering of the distances, and adds a color bar to indicate whether the sample came from a B or T lineage sample.

color <- highlight[pdata$BorT]
heatmap(as.matrix(d), ColSideColor=color, col=divergent, symm=TRUE)

2.3 Clustering: `cmdscale()` and other approaches to dimension reduction

We have characterized our 128 samples in 1000 dimensions. Hierarchical clustering suggests that somehow there is pattern in this high-dimensional data. There are a number of statistical techniques that we can use to effectively reduce the dimensions of the data. The reduced-dimension representation might be used for visualization or downstream analysis.

One method for reducing the dimension of the data is classical multi-dimensional scaling (principle coordinates analysis). The idea is to take a measure of dissimilarity (like the distance matrix calculated above) and reduce it to a set of points (e.g., in 2 dimensions) such that the distance between points is approximately equal to the dissimilarity. This is very easy to implement in R, using the cmdscale() function.

mds <- cmdscale(dist(t(exprs1000)))
plot(mds, col=color)

mds is a 128 x 2 matrix, and the columns can be used as a reduced-dimension replacement for the 1000 expression values.

Principle components analysis is a similar approach to visualization and down-stream analysis in reduced dimensions; see prcomp().

2.4 Classification: `knn()`

Classification (‘supervised machine learning’) assigns ‘test’ samples into a group based on some measure of similarity to a set of ‘training’ samples. There are many varieties of classification. knn (‘k’ nearest neighbors) classification assigns the test sample to the majority-vote of it’s k nearest neighbors. knn() and other common classifiers are defined in the class package, so we start by adding that to our R session.

library(class)

We’ll start by creating a training set. We’ll choose the first half of the ‘B’ individuals and the first half of the ‘T’ individuals as members. First, we can get the index of each row using seq_along()

seq_along(pdata$BorT)

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23
##  [24]  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46
##  [47]  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
##  [70]  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92
##  [93]  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
## [116] 116 117 118 119 120 121 122 123 124 125 126 127 128

These values can be split into groups (a list-of-integers) using split()

idxByType <- split(seq_along(pdata$BorT), pdata$BorT)
idxByType

## $B
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## [65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
## 
## $T
##  [1]  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [25] 120 121 122 123 124 125 126 127 128

We can ask about the number of each element of idxByType by extracting the element and using the length() function, e.g., length(idxByType[["B"]]). We can then use head() to select the first half of the elements

Bidx <- idxByType[["B"]]
BTrainIdx <- head(Bidx, length(Bidx) / 2)

Likewise for T:

Tidx <- idxByType[["T"]]
TTrainIdx <- head(Tidx, length(Tidx) / 2)

There are two fun things to notice, and these help us expand our knowledge of R.

Notice that we are applying the same sequence of operations to the B and to the T indexes. R lets us write a function to capture this repeated operation

firstHalf <- function(x) {
    head(x, length(x) / 2)
}

and we can verify that this works

firstHalf(idxByType[["B"]])

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

firstHalf(idxByType[["T"]])

##  [1]  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111

We’d really like to apply this to function to each element of the list idxByType. This can be done using lapply(), where the first argument is a vector and the second argument is the function we’d like to apply

trainIdx <- lapply(idxByType, firstHalf)
trainIdx

## $B
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## 
## $T
##  [1]  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111

OK, let’s compose our training set from the first half of each group using cbind() to create a single matrix of training individuals

Bhalf <- exprs1000[, trainIdx[["B"]] ]
Thalf <- exprs1000[, trainIdx[["T"]] ]
train <- cbind(Bhalf, Thalf)

Our test set is the individuals not included in the training set. One way to select these individuals is to write a secondHalf() function; another way is to subset exprs1000 to include the columns that are not in train

test <- exprs1000[, ! colnames(exprs1000) %in% colnames(train) ]

We’ll need to know the classification of the training and testing sets

cl_train <- pdata[colnames(train), "BorT"]
cl_test <- pdata[colnames(test), "BorT"]

knn() is invoked with the training and test sets, and the known classification of the training set. The argument k is the number of nearest neighbors to consider; the default is k=1. The return value is the estimated classification of each test sample

cl_test_est  <- knn(t(train), t(test), cl_train)

A characterization of how well the classifier performs is the ‘confusion matrix’, which summarizes the known versus estimated classification of the test individuals in a cross-tabulation

xtabs(~cl_test + cl_test_est)

##        cl_test_est
## cl_test  B  T
##       B 48  0
##       T  0 17

Here, the classifier is performing very well – there are no incorrect (off-diagonal) assignments.

We’ve been using the known data to create a classifier and then to evaluate it’s performance. The decisions to use half of the B and half of the T samples, and to include the first half of each group in the ‘training’ group, were arbitrary, and it turns out that a better strategy is ‘leave-one-out’ cross-validation. This works by using all but one sample (e.g., the first sample) in the training group, and then classifying the sample that was left out. One then repeats for the second, third, … samples. R implements this for knn with the knn.cv() function (again with a default k=1 nearest neighbors).

knn_loo <- knn.cv(t(exprs1000), pdata$BorT)
xtabs(~pdata$BorT + knn_loo)

##           knn_loo
## pdata$BorT  B  T
##          B 95  0
##          T  0 33

As an exercise, use knn.cv() to develop a classifier for T-cell subtypes; explore the consequences of different values of k. Here’s a sketch of the approach:

Create a subset of the expression and phenotype data with only the T cell types; remember to use factor() to remove unused levels
```
keep <- pdata$BorT == "T"
exprT <- exprs1000[, keep]
classT <- factor(pdata$BT[keep])
```

Use knn.cv() to perform the cross-validation, and xtabs() to summarize the confusion matrix. Interpret the confusion matrix.

xtabs(~classT + knn.cv(t(exprT), classT))

##       knn.cv(t(exprT), classT)
## classT T T1 T2 T3 T4
##     T  0  0  4  1  0
##     T1 0  0  1  0  0
##     T2 5  0  4  3  3
##     T3 1  0  2  6  1
##     T4 0  0  0  1  1

Experiment with different values of k. Can you interpret the consequences of k for the statistical principles of precision and accuracy?

A.3 – Statistical Analysis

Martin Morgan martin.morgan@roswellpark.org

16 - 17 May, 2016

Contents

1 Univariate Analysis

1.1 Descriptive statistics

1.2 t-tests

1.3 Linear models

1.4 Chi-squared tests

2 Multivariate Analysis: Machine Learning

2.1 Preliminary exploration and cleanup

2.2 Clustering: `dist()` and `hclust()`

2.3 Clustering: `cmdscale()` and other approaches to dimension reduction

2.4 Classification: `knn()`

3 Lessons learned

A.3 – Statistical Analysis

Martin Morgan martin.morgan@roswellpark.org

16 - 17 May, 2016

Contents

1 Univariate Analysis

1.1 Descriptive statistics

1.2 t-tests

1.3 Linear models

1.4 Chi-squared tests

2 Multivariate Analysis: Machine Learning

2.1 Preliminary exploration and cleanup

2.2 Clustering: dist() and hclust()

2.3 Clustering: cmdscale() and other approaches to dimension reduction

2.4 Classification: knn()

3 Lessons learned

2.2 Clustering: `dist()` and `hclust()`

2.3 Clustering: `cmdscale()` and other approaches to dimension reduction

2.4 Classification: `knn()`