Abstract In this workshop, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The participants will gain a general overview of Bioconductor packages for mass spectrometry and proteomics, and learn how to navigate raw data and reconstruct quantitative data. The workshop is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.
This short tutorial is part of the ProteomicsBioc2016Workshop
package (version 0.2.1), available at https://github.com/lgatto/ProteomicsBioc2016Workshop.
This vignette available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
Modified: 2016-06-20 22:37:35
Compiled: Wed Jun 22 21:16:44 2016
To install this package and build the vignette:
If the BiocInstaller is not installted, start with
source("https://bioconductor.org/biocLite.R")
then
library("BiocInstaller")
biocLite("lgatto/ProteomicsBioc2016Workshop",
dependencies = TRUE, build_vignettes=TRUE)
To launch the vignette
library("ProteomicsBioc2016Workshop")
bioc2016()
In Bioconductor version 3.4, there are respectively 83 proteomics and 55 mass spectrometry software packages and 15 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages()
, massSpectrometryPackages()
and massSpectrometryDataPackages()
and explored interactively.
library("RforProteomics")
pp <- proteomicsPackages("3.3")
display(pp)
AnnotationHub is a cloud resource set up and managed by the Bioconductor project that programmatically disseminates omics data. I am currently working on contributing proteomics data.
library("AnnotationHub")
ah <- AnnotationHub()
## snapshotDate(): 2016-06-06
query(ah, "proteomics")
## AnnotationHub with 4 records
## # snapshotDate(): 2016-06-06
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description,
## # preparerclass, tags, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH49006"]]'
##
## title
## AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
## AH49007 | PXD000001: Peptide-level quantitation data
## AH49008 | PXD000001: raw mass spectrometry data
## AH49009 | PXD000001: MS-GF+ identiciation data
Below, we download a raw mass spectrometry dataset with identifier AH49008
and store it in a variable names ms
.
ms <- ah[["AH49008"]]
## Warning: Failed to parse headers:
## 220-
## 220-
## 220-ftp.pride.ebi.ac.uk FTP server
## 220-
## 220-
## 220
## 331 Please specify the password.
## 230 Login successful.
## 257 "/"
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 213 20150116075122
## 229 Entering Extended Passive Mode (|||44630|)
## 200 Switching to Binary mode.
## 213 450032788
## 150 Opening BINARY mode data connection for TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML (450032788 bytes).
## 226 Transfer complete.
ms
## Mass Spectrometry file handle.
## Filename: 55314
## Number of scans: 7534
The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The filename, 55314, is not very descriptive because the data originates from the AnnotationHub
cloud repository. If the data was read from a local file, is would be named as the mzML
(or mzXML
) file.
Contemporary MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRIDE data base at the EBI for MS/MS experiments, PASSEL at the ISB for SRM data and the MassIVE resource. The rpx is an interface to ProteomeXchange and provides a basic and unified access to PX data.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD003861 2016-06-22 20:04:04 New
## 2 PXD003326 2016-06-22 13:50:09 New
## 3 PXD004081 2016-06-22 13:30:33 New
## 4 PXD003943 2016-06-22 10:40:18 New
## 5 PXD002823 2016-06-22 09:18:12 New
## 6 PXD004411 2016-06-21 18:46:36 Updated information
## 7 PXD004411 2016-06-21 17:41:43 New
## 8 PXD004410 2016-06-21 17:41:41 New
## 9 PXD004408 2016-06-21 17:41:39 New
## 10 PXD004406 2016-06-21 17:41:35 New
## 11 PXD004405 2016-06-21 17:41:33 New
## 12 PXD004404 2016-06-21 17:41:31 New
## 13 PXD004403 2016-06-21 17:41:28 New
## 14 PXD004402 2016-06-21 17:41:26 New
## 15 PXD004401 2016-06-21 17:41:23 New
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 10 files
## [1] 'F063721.dat' ... [10] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [10] "erwinia_carotovora.fasta"
Other metadata for the px
dataset:
pxtax(px)
pxurl(px)
pxref(px)
Data files can then be downloaded with the pxget
function as illustrated below.
mzf <- pxget(px, pxfiles(px)[6])
## Downloading 1 file
mzf
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
The mzR
package provides an interface to the proteowizard code base, the legacy RAMP is a non-sequential parser and other C/C++ code to access various raw data files, such as mzML
, mzXML
, netCDF
, and mzData
. The data is accessed on-disk, i.e it does not get loaded entirely in memory by default. The three main functions are openMSfile
to create a file handle to a raw data file, header
to extract metadata about the spectra contained in the file and peaks
to extract one or multiple spectra of interest. Other functions such as instrumentInfo
, or runInfo
can be used to gather general information about a run.
library("mzR")
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
hd <- header(ms)
dim(hd)
## [1] 7534 21
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum"
head(peaks(ms, 234))
## [,1] [,2]
## [1,] 100.3852 93.86617
## [2,] 101.6572 83.94572
## [3,] 107.3893 91.63007
## [4,] 112.0505 15697.57129
## [5,] 113.0538 106.47455
## [6,] 113.3050 83.92274
str(peaks(ms, 1:5))
## List of 5
## $ : num [1:25800, 1:2] 400 400 400 400 400 ...
## $ : num [1:25934, 1:2] 400 400 400 400 400 ...
## $ : num [1:26148, 1:2] 400 400 400 400 400 ...
## $ : num [1:26330, 1:2] 400 400 400 400 400 ...
## $ : num [1:26463, 1:2] 400 400 400 400 400 ...
Exercise Extract the index of the MS2 spectrum with the highest base peak intensity and plot its spectrum (using the
plot
method - we will see more about MS data visualisation in the next section). Is the data centroided or in profile mode?
The ProteomicsBioc2016Workshop
package distributes a small identification result file (see ?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid
) that we load and parse using infrastructure from the mzID package.
library("mzID")
(f <- dir(system.file("extdata", package = "ProteomicsBioc2016Workshop"),
pattern = "mzid", full.names = TRUE))
## [1] "/tmp/RtmpDXbAgP/Rinst1f032df26d37/ProteomicsBioc2016Workshop/extdata/TMT_Erwinia.mzid.gz"
id <- mzID(f)
## reading TMT_Erwinia.mzid.gz... DONE!
software(id)
## version name id
## 1 Beta (v10072) MS-GF+ ID_software
id
## An mzID object
##
## Software used: MS-GF+ (version: Beta (v10072))
##
## Rawfile: /home/lgatto/dev/00_github/RforProteomics/sandbox/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
##
## Database: /home/lgatto/dev/00_github/RforProteomics/sandbox/erwinia_carotovora.fasta
##
## Number of scans: 5287
## Number of PSM's: 5563
Various data can be extracted from the mzID
object, using one the accessor functions such as database
, sofware
, scans
, peptides
, … The object can also be converted into a data.frame
using the flatten
function.
head(flatten(id))
## spectrumid scan number(s) acquisitionnum passthreshold rank
## 1 scan=5782 5782 5782 TRUE 1
## 2 scan=6037 6037 6037 TRUE 1
## 3 scan=5235 5235 5235 TRUE 1
## 4 scan=5397 5397 5397 TRUE 1
## 5 scan=6075 6075 6075 TRUE 1
## 6 scan=5761 5761 5761 TRUE 1
## calculatedmasstocharge experimentalmasstocharge chargestate
## 1 1080.2321 1080.2325 3
## 2 1002.2115 1002.2089 3
## 3 1189.2800 1189.2836 3
## 4 960.5365 960.5365 3
## 5 1264.3419 1264.3409 3
## 6 1268.6501 1268.6429 2
## ms-gf:denovoscore ms-gf:evalue ms-gf:rawscore ms-gf:specevalue
## 1 174 5.430080e-21 147 3.764831e-27
## 2 245 9.943751e-20 214 6.902626e-26
## 3 264 2.564787e-19 211 1.778789e-25
## 4 178 2.581753e-18 154 1.792541e-24
## 5 252 2.178423e-17 188 1.510364e-23
## 6 138 2.329453e-17 123 1.618941e-23
## assumeddissociationmethod isotopeerror isdecoy post pre end start
## 1 HCD 0 FALSE S R 84 50
## 2 HCD 0 FALSE R K 315 288
## 3 HCD 0 FALSE A R 224 192
## 4 HCD 0 FALSE - R 290 264
## 5 HCD 0 FALSE F R 153 119
## 6 HCD 0 FALSE Y K 286 264
## accession length description
## 1 ECA1932 155 outer membrane lipoprotein
## 2 ECA1147 434 trigger factor
## 3 ECA0013 295 ribose-binding periplasmic protein
## 4 ECA1731 290 flagellin
## 5 ECA1443 298 UTP--glucose-1-phosphate uridylyltransferase
## 6 ECA1444 468 6-phosphogluconate dehydrogenase, decarboxylating
## pepseq modified modification
## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR FALSE <NA>
## 2 TQVLDGLINANDIEVPVALIDGEIDVLR FALSE <NA>
## 3 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR FALSE <NA>
## 4 SQILQQAGTSVLSQANQVPQTVLSLLR FALSE <NA>
## 5 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR FALSE <NA>
## 6 WTSQSSLDLGEPLSLITESVFAR FALSE <NA>
## idFile
## 1 TMT_Erwinia.mzid.gz
## 2 TMT_Erwinia.mzid.gz
## 3 TMT_Erwinia.mzid.gz
## 4 TMT_Erwinia.mzid.gz
## 5 TMT_Erwinia.mzid.gz
## 6 TMT_Erwinia.mzid.gz
## spectrumFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 4 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 5 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## 6 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## databaseFile
## 1 erwinia_carotovora.fasta
## 2 erwinia_carotovora.fasta
## 3 erwinia_carotovora.fasta
## 4 erwinia_carotovora.fasta
## 5 erwinia_carotovora.fasta
## 6 erwinia_carotovora.fasta
The mzR
interface provides a similar interface. It is however much faster as it does not read all the data into memory and only extracts relevant data on demand. It has also accessor functions such as softwareInfo
, mzidInfo
, … (use showMethods(classes = "mzRident", where = "package:mzR")
) to see all available methods.
library("mzR")
id2 <- openIDfile(f)
id2
## Identification file handle.
## Filename: TMT_Erwinia.mzid.gz
## Number of psms: 5655
softwareInfo(id2)
## [1] "MS-GF+ Beta (v10072) "
## [2] "ProteoWizard MzIdentML 3.0.6239 ProteoWizard"
The identification data can be accessed as a data.frame
with the psms
accessor.
head(psms(id2))
## spectrumID chargeState rank passThreshold experimentalMassToCharge
## 1 scan=5782 3 1 TRUE 1080.2325
## 2 scan=6037 3 1 TRUE 1002.2089
## 3 scan=5235 3 1 TRUE 1189.2836
## 4 scan=5397 3 1 TRUE 960.5365
## 5 scan=6075 3 1 TRUE 1264.3409
## 6 scan=5761 2 1 TRUE 1268.6429
## calculatedMassToCharge sequence modNum
## 1 1080.2321 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR 0
## 2 1002.2115 TQVLDGLINANDIEVPVALIDGEIDVLR 0
## 3 1189.2800 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR 0
## 4 960.5365 SQILQQAGTSVLSQANQVPQTVLSLLR 0
## 5 1264.3419 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR 0
## 6 1268.6501 WTSQSSLDLGEPLSLITESVFAR 0
## isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
## 1 FALSE S R 50 84 ECA1932 155
## 2 FALSE R K 288 315 ECA1147 434
## 3 FALSE A R 192 224 ECA0013 295
## 4 FALSE - R 264 290 ECA1731 290
## 5 FALSE F R 119 153 ECA1443 298
## 6 FALSE Y K 264 286 ECA1444 468
## DatabaseDescription acquisitionNum
## 1 ECA1932 outer membrane lipoprotein 5782
## 2 ECA1147 trigger factor 6037
## 3 ECA0013 ribose-binding periplasmic protein 5235
## 4 ECA1731 flagellin 5397
## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
## 6 ECA1444 6-phosphogluconate dehydrogenase, decarboxylating 5761
Exercise Is there a relation between the length of a protein and the number of identified peptides, conditioned by the (average) e-value of the identifications?
While searches are generally performed using third-party software independently of R or can be started from R using a system
call, the rTANDEM package allows one to execute such searches using the X!Tandem engine.
library("rTANDEM")
?rtandem
The MSGPplus package provides an interface to the very fast MSGF+
engine.
library("MSGFplus")
parameters <- msgfPar(database = 'milk-proteins.fasta',
tolerance = '20 ppm',
instrument = 'TOF')
runMSGF(parameters, 'msraw.mzML')
The above sections introduced low-level interfaces to raw and identification results. The MSnbase package provides abstractions for raw data through the MSnExp
class and containers for quantification data via the MSnSet
class. Both store the actual assay data (spectra or quantitation matrix) and sample and feature metadata, accessed with spectra
(or the [
, [[
operators) or exprs
, pData
and fData
.
The figure below give a schematics of an MSnSet
instance and the relation between the assay data and the respective feature and sample metadata.
Another useful slot is processingData
, accessed with processingData(.)
, that records all the processing that objects have undergone since their creation (see examples below).
The readMSData
will parse the raw data, extract the MS2 spectra and construct an MS experiment file.
library("MSnbase")
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
quantFile
## [1] "/home/ubuntu/R-libs/MSnbase/extdata/dummyiTRAQ.mzXML"
msexp <- readMSData(quantFile, verbose=FALSE)
msexp
## Object of class "MSnExp"
## Object size in memory: 0.2 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of MS1 acquisitions: 1
## Number of MSn scans: 5
## Number of precursor ions: 5
## 4 unique MZs
## Precursor MZ's: 437.8 - 716.34
## MSn M/Z range: 100 2016.66
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Wed Jun 22 21:21:37 2016
## MSnbase version: 1.21.7
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1.1 X2.1 ... X5.1 (5 total)
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
The identification results stemming from the same raw data file can then be used to add PSM matches.
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzid$")
identFile
## [1] "/home/ubuntu/R-libs/MSnbase/extdata/dummyiTRAQ.mzid"
msexp <- addIdentificationData(msexp, identFile)
## reading dummyiTRAQ.mzid... DONE!
fData(msexp)
## spectrum scan number(s) passthreshold rank calculatedmasstocharge
## X1.1 1 1 TRUE 1 645.0375
## X2.1 2 2 TRUE 1 546.9633
## X3.1 3 NA NA NA NA
## X4.1 4 NA NA NA NA
## X5.1 5 5 TRUE 1 437.2997
## experimentalmasstocharge chargestate ms-gf:denovoscore ms-gf:evalue
## X1.1 645.3741 3 77 79.36958
## X2.1 546.9586 3 39 13.46615
## X3.1 NA NA NA NA
## X4.1 NA NA NA NA
## X5.1 437.8040 2 5 366.38422
## ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod
## X1.1 -39 5.527468e-05 CID
## X2.1 -30 9.399048e-06 CID
## X3.1 NA NA <NA>
## X4.1 NA NA <NA>
## X5.1 -42 2.577830e-04 CID
## isotopeerror isdecoy post pre end start accession length
## X1.1 1 FALSE A R 186 170 ECA0984;ECA3829 231
## X2.1 0 FALSE A K 62 50 ECA1028 275
## X3.1 <NA> NA <NA> <NA> NA NA <NA> NA
## X4.1 <NA> NA <NA> <NA> NA NA <NA> NA
## X5.1 1 FALSE L K 28 22 ECA0510 166
## description
## X1.1 DNA mismatch repair protein;acetolactate synthase isozyme III large subunit
## X2.1 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
## X3.1 <NA>
## X4.1 <NA>
## X5.1 putative capsular polysacharide biosynthesis transferase
## pepseq modified modification idFile
## X1.1 VESITARHGEVLQLRPK FALSE NA dummyiTRAQ.mzid
## X2.1 IDGQWVTHQWLKK FALSE NA dummyiTRAQ.mzid
## X3.1 <NA> NA NA <NA>
## X4.1 <NA> NA NA <NA>
## X5.1 LVILLFR FALSE NA dummyiTRAQ.mzid
## databaseFile nprot npep.prot npsm.prot npsm.pep
## X1.1 erwinia_carotovora.fasta 2 1 1 1
## X2.1 erwinia_carotovora.fasta 1 1 1 1
## X3.1 <NA> NA NA NA NA
## X4.1 <NA> NA NA NA NA
## X5.1 erwinia_carotovora.fasta 1 1 1 1
msexp[[1]]
## Object of class "Spectrum2"
## Precursor: 645.3741
## Retention time: 25:1
## Charge: 3
## MSn level: 2
## Peaks count: 2921
## Total ion count: 668170086
as(msexp[[1]], "data.frame")[100:105, ]
## mz i
## 100 141.0990 588594.812
## 101 141.1015 845401.250
## 102 141.1041 791352.125
## 103 141.1066 477623.000
## 104 141.1091 155376.312
## 105 141.1117 4752.541
MSnExp
object load all MS data into memory. This is a viable solution for MS2 data, but does not scale to MS1 data, especially when multiple files are loaded together. With the help of Johannes Rainer, a new MSnExp
class supporting on disk access is being developed.
chromatogram(ms)
It is of course possible to overlay several chromatograms. The code chunks below are not executed dynamically so save time with downloading raw data files.
## manual download
library("RforProteomics")
url <- "http://proteome.sysbiol.cam.ac.uk/lgatto/files/Thermo-HELA-PRT/"
f1 <- downloadData(file.path(url, "Thermo_Hela_PRTC_1.mzML"))
f2 <- downloadData(file.path(url, "Thermo_Hela_PRTC_2.mzML"))
f3 <- downloadData(file.path(url, "Thermo_Hela_PRTC_3.mzML"))
## plotting
c1 <- chromatogram(f1)
c2 <- chromatogram(f2, plot = FALSE)
lines(c2, col = "steelblue", lty = "dashed", lwd = 2)
c3 <- chromatogram(f3, plot = FALSE)
lines(c2, col = "orange", lty = "dotted")
par(mfrow = c(1, 2))
xic(ms, mz = 636.925, width = 0.01)
x <- xic(ms, mz = 636.925, width = 0.01, rtlim = c(2120, 2200))
We first load a test iTRAQ data called itraqdata
and distributed as part of the MSnbase package using the data
function. This is a pre-packaged data that comes as a dedicated data structure called MSnExp
. We then plot
the 10th spectrum using specific code that recognises what to do with an element of an MSnExp
.
data(itraqdata)
itraqdata
## Object of class "MSnExp"
## Object size in memory: 1.88 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of MS1 acquisitions: 1
## Number of MSn scans: 55
## Number of precursor ions: 55
## 55 unique MZs
## Precursor MZ's: 401.74 - 1236.1
## MSn M/Z range: 100 2069.27
## MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
plot(itraqdata[[10]], reporters = iTRAQ4, full=TRUE)
The ms
data is not pre-packaged as an MSnExp
data. It is a more bare-bone mzRramp object, a pointer to a raw data file (here TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML): we need first to extract a spectrum of interest (here the 3071st spectrum, an MS1 spectrum), and use the generic plot
function to visualise the spectrum.
plot(peaks(ms, 3071), type = "h",
xlab = "M/Z", ylab = "Intensity",
sub = formatRt(hd[3071, "retentionTime"]))
The importance of flexible access to specialised data becomes visible in the figure below (taken from the RforProteomics visualisation vignette). Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.
In this code chunks we start by selecting relevant spectra of interest. We will focus on the first MS1 spectrum acquired after 30 minutes of retention time.
## (1) Open raw data file
ms <- openMSfile(mzf)
## (2) Extract the header information
hd <- header(ms)
## (3) MS1 spectra indices
ms1 <- which(hd$msLevel == 1)
## (4) Select MS1 spectra with retention time between 30 and 35 minutes
rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35
## (5) Indices of the 1st and 2nd MS1 spectra after 30 minutes
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
## (6) Interleaved MS2 spectra
ms2 <- (i+1):(j-1)
Now now extract and plot all relevant information:
chromatogram
.chromatogram(ms)
chromatogram(ms)
abline(v = hd[i, "retentionTime"], col = "red")
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5))
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
length(ms2)
MS2 spectra highlighted above.par(mfrow = c(5, 2), mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Putting it all together:
layout(lout)
par(mar=c(4,2,1,1))
chromatogram(ms)
abline(v = hd[i, "retentionTime"], col = "red")
par(mar = c(3, 2, 1, 0))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
par(mar = c(3, 0.5, 1, 1))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5),
yaxt = "n")
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
par(mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Below, we illustrate some additional visualisation and animations of raw MS data, also taken from the RforProteomics visualisation vignette. On the left, we have a heatmap visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.
## (1) MS space heaptmap
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
## 1
ff <- colorRampPalette(c("yellow", "steelblue"))
trellis.par.set(regions=list(col=ff(100)))
m1 <- plot(M, aspect = 1, allTicks = FALSE)
## (2) Same data as (1), in 3 dimenstion
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)
## (3) The 2 MS1 and 10 interleaved MS2 spectra from above
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
## 1
m3 <- plot3D(M2)
grid.arrange(m1, m2, m3, ncol = 3)
Below, we have animations build from extracting successive slices as above.