Chapter 9 Demo

9.1 Project Setup

I suggest building your data analysis projects in RStudio(Click File - New project - New dictionary - Empty project). Then assign a name for your project. I also recommend the following tips if you are familiar with it.

  • Use git/github to make version control of your code and sync your project online.

  • NOT use your name for your project because other peoples might cooperate with you and someone might check your data when you publish your papers. Each project should be a work for one paper or one chapter in your thesis.

  • Use workflow document(txt or doc) in your project to record all of the steps and code you performed for this project. Treat this document as digital version of your experiment notebook

  • Use data folder in your project folder for the raw data and the results you get in data analysis

  • Use figure folder in your project folder for the figure

  • Use munuscript folder in your project folder for the manuscript (you could write paper in rstudio with the help of template in Rmarkdown)

  • Just double click [yourprojectname].Rproj to start your project

9.2 Data input

xcms does not support all of the Raw files from every mass spectrometry manufacturers. You need to convert your Raw data into some open-source data format such as mzData, mzXML or CDF files. The tool is MScovert from ProteoWizard.

9.3 Optimization

IPO package could be used to optimaze the parameters for XCMS. Try the following code.

peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')
path <- list.files('D:/metademo/data/oq/',full.names = T,recursive = T)
# change to 5 for obitrap
peakpickingParameters$ppm <- 15
resultPeakpicking <- 
  optimizeXcmsSet(files = path[c(1,2,3)], 
                  params = peakpickingParameters,
                  nSlaves = 1,
                  subdir = NULL)

optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset
retcorGroupParameters <- getDefaultRetGroupStartingParams()
resultRetcorGroup <-
  optimizeRetGroup(xset = optimizedXcmsSetObject, 
                   params = retcorGroupParameters, 
                   subdir = NULL)
para <- capture.output(writeRScript(resultPeakpicking$best_settings$parameters, resultRetcorGroup$best_settings), type = "message")
save(para,file = 'para.RData')

9.4 Wrap function

Here we would use the optimized parameters for peak picking, retention time correction and peaks filling.

getrtmz <- function(path,index = NULL){
peakwidth <- as.numeric(unlist(str_extract_all(para[grepl('peakwidth',para)],'\\d+\\.*\\d*')))
ppm <- as.numeric(unlist(str_extract_all(para[grepl('ppm',para)],'\\d+')))
noise <- as.numeric(unlist(str_extract_all(para[grepl('noise',para)],'\\d+')))
snthresh <- as.numeric(unlist(str_extract_all(para[grepl('snthresh',para)],'\\d+')))
mzdiff <- as.numeric(unlist(str_extract_all(para[grepl('mzdiff',para)],'\\d+\\.*\\d*')))
prefilter <- as.numeric(unlist(str_extract_all(para[grepl('prefilter',para)],'\\d+\\.*\\d*')))
integrate <- as.numeric(unlist(str_extract_all(para[grepl('integrate',para)],'\\d+')))
profStep <- round(as.numeric(unlist(str_extract_all(para[grepl('profStep',para)],'\\d+\\.*\\d*'))),1)
center <- as.numeric(unlist(str_extract_all(para[grepl('center',para)],'\\d+')))
response <- as.numeric(unlist(str_extract_all(para[grepl('response',para)],'\\d+')))
gapInit <- as.numeric(unlist(str_extract_all(para[grepl('gapInit',para)],'\\d+\\.*\\d*')))
gapExtend <- as.numeric(unlist(str_extract_all(para[grepl('gapExtend',para)],'\\d+\\.*\\d*')))
factorDiag <- as.numeric(unlist(str_extract_all(para[grepl('factorDiag',para)],'\\d+')))
factorGap <- as.numeric(unlist(str_extract_all(para[grepl('factorGap',para)],'\\d+')))
localAlignment <- as.numeric(unlist(str_extract_all(para[grepl('localAlignment',para)],'\\d+')))
bw <- as.numeric(unlist(str_extract_all(para[grepl('bw',para)],'\\d+\\.*\\d*')))
mzwid <- as.numeric(unlist(str_extract_all(para[grepl('mzwid',para)],'\\d+\\.*\\d*')))
minfrac <- as.numeric(unlist(str_extract_all(para[grepl('minfrac',para)],'\\d+\\.*\\d*')))
minsamp <- as.numeric(unlist(str_extract_all(para[grepl('minsamp',para)],'\\d+')))
max <-  as.numeric(unlist(str_extract_all(para[grepl('max',para)],'\\d+')))
  files <- list.files(path,full.names = T,recursive = T)
    files <- files[index]
  xset <- xcmsSet(files,
  method = "centWave",
  peakwidth       = peakwidth,
  ppm             = ppm,
  noise           = noise,
  snthresh        = snthresh,
  mzdiff          = mzdiff,
  prefilter       = prefilter,
  mzCenterFun     = "wMean",
  integrate       = integrate,
  fitgauss        = FALSE,
  verbose.columns = FALSE)
xset <- retcor( 
  method         = "obiwarp",
  plottype       = "none",
  distFunc       = "cor_opt",
  profStep       = profStep,
  center         = center,
  response       = response,
  gapInit        = gapInit,
  gapExtend      = gapExtend,
  factorDiag     = factorDiag,
  factorGap      = factorGap,
  localAlignment = localAlignment)
xset <- group( 
  method  = "density",
  bw      = bw,
  mzwid   = mzwid,
  minfrac = minfrac,
  minsamp = minsamp,
  max     = max)

xset <- fillPeaks(xset)

9.4.1 Peak picking

The first step to process the MS data is that find the peaks against the noises. In xcms, all of related staffs are handled by xcmsSet function.

For any functions in xcms or R, you could get their documents by type ? before certain function. Another geek way is input the name of the function in the console of Rstudio and press F1 for help.


In the document of xcmsset, we could set the sample classes, profmethod, profparam, polarity,etc. In the online version, such configurations are shown in certain windows. In the local analysis environment, such parameters are setup by yourselves. However, I think the default configurations could satisfied most of the analysis because related information should have been recorded in your Raw data and xcms could find them. All you need to do is that show the data dictionary for xcmsSet.

If your data have many groups such as control and treated group, just put them in separate subfolder of the data folder and xcmsSet would read them as separated groups.

9.4.2 Data correction

Reasons of data correction might come from many aspects such as the unstable instrument and pollution on column. In xcms, the most important correction is retention time correction. Remember the original retention time might changed and use another object to save the new object.

9.4.3 Peaks filling

After the retention time correction, filling the missing peaks could be done by fillpeaks. Peaks filling could avoid two many NAs for false statistical analysis. The algorithm could use the baseline signal to impute the data.

9.5 Peaks list

Then we could extract the peaks list from xcmsSet objects.

# get the xcmsset object
neg <- getrtmz('D:/metademo/data/')
# back up the xcmsset object
save(neg,file = 'neg.Rdata')
# get the number
npeaks <- nrow(neg@groups)
# get the EIC, boxplot and diffreport, eixmax should be equal to the numbers of peaks groups in the pos objects 
report <- CAMERA::annotateDiffreport(neg,filebase = 'peaklistneg',metlin = T, eicmax = npeaks, classeic = neg@phenoData$class)
# save the report as a csv file
write.csv(report,file = 'all.csv')

9.6 Peaks filtering

After we get the peaks list, it’s nessary to filter the peaks list to retain the peaks with high quality for further analysis. Normally, we could use the relative standard deviation of blank and pooled QC samples to control the peaks list.

# get the peak intensity, m/z, retention time and group information as list
mzrt <- enviGCMS::getmzrt(neg)
# get the mean and rsd for each group
mzrtm <- enviGCMS::getdoe(mzrt)
gm <- mzrtm$groupmean
gr <- mzrtm$grouprsd
# find the blank group and pool QC group
blk <- grepl('k',colnames(gm))
pqc <- grepl('q',colnames(gm))
# filter by pool QC and blank's group mean intensity(pool QC should larger than three times of blank), return numbers and index
sum(indexmean <- apply(gm,1,function(x) all(x[pqc]>= 3*x[blk])))
# filt by pool qc rsd%, return numbers and index
rsdcf <- 30
sum(indexrsd <- apply(gr,1,function(x) ifelse([pqc]),T,x[pqc]<rsdcf)))
# overlap with rsd% and mean filter
sum(index <- indexmean&indexrsd)

# new list, update group and remove pool qc/blk
qcindex <- grepl('k',mzrt$group$class) | grepl('q',mzrt$group$class)
mzrtfilter <- list(data = mzrt$data[index,!qcindex],
                   mz = mzrt$mz[index],
                   rt = mzrt$rt[index],
                   group = droplevels(mzrt$group$class[!qcindex,drop =T]))
# get the filtered csv
enviGCMS::getupload(mzrtfilter,name = 'peakfilter')

9.7 Normalization (Optional)

Normailizaiton is nesscery if you data are collected at different batch or runned in differen instrument parameters.

# visulize the batch effect
mzrtsim::rlaplot(mzrt$data,lv = mzrt$group$class)
mzrtsim::ridgesplot(mzrt$data,lv = mzrt$group$class)
# get the simulation data and test on NOREVA
sim <- mzrtsim::simmzrt(mzrt$data)
# correct the batch effect by sva
mzrtcor <- mzrtsim::svacor(mzrt$data,lv = mzrt$group$class)
# visulize the batch effect correction
li <- mzrtsim::limmaplot(mzrtcor,lv = mzrt$group$class)
# return the corrected data
mzrt$data <- mzrtcor$dataCorrected

9.8 Statistic analysis

Here we could use caret package to perform statistical analysis.

## Spliting data
trainIndex <- createDataPartition(mzrtfilter$data, p = .8, 
                                  list = FALSE, 
                                  times = 1)
## Get the training and testing datasets
Train <- data[ trainIndex,]
Test  <- data[-trainIndex,]
## Set the cross validation method
fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)
# extra papameters for GBM 
gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)

gbmFit <- train(mzrtfilter$group ~ ., data = training, 
                 method = "gbm", 
                 trControl = fitControl, 
                 verbose = FALSE, 
                 ## Now specify the exact models 
                 ## to evaluate:
                 tuneGrid = gbmGrid)
# show the fitting process
# ANOVA analysis for model selection
# find the important variables
Imp <- varImp(fit)

9.9 Annotation

Here I use xMSannotator package to make annotation with HMDB as reference database.

num_nodes = 10
negsub <- getrtmz('D:/metademo/data/oq/')
anno <- xsetplus::fanno(negsub,outloc = 'D:/metademo/anno',mode = 'neg')

9.10 Pathway Analysis

We could output the files for online pathway analysis with mummichog algorithm.

# get the file
xsetplus::mumdata(neg,lv = mzrt$group$class)
# mummichog1 -f 'test.txt' -o myResult

9.11 MetaboAnalyst

Actully, after you perform data correction, you have got the data matrix for statistic analysis. You might choose MetaboAnalyst online or offline to make furthor analysis, which supplied more statistical choices than xcms.

The input data format for MetaboAnalyst should be rows for peaks and colomns for samples. You could also add groups infomation if possible. Use the following code to get the data for analysis.

# get the csv file for
enviGCMS::getupload(neg,name = 'metabo')

9.12 Summary

This is the offline metaboliomics data process workflow. For each study, details would be different and F1 is always your best friend.

Enjoy yourself in data mining!