Chapter 5 Workflow
You could check this book for metabolomics data analysis (S. Li 2020).
5.1 Platform for metabolomics data analysis
Here is a list for related open source projects
5.1.1 XCMS & XCMS online
XCMS online is hosted by Scripps Institute. If your datasets are not large, XCMS online would be the best option for you. Recently they updated the online version to support more functions for systems biology. They use metlin and iso metlin to annotate the MS/MS data. Pathway analysis is also supported. Besides, to accelerate the process, xcms online employed stream (windows only). You could use stream to connect your instrument workstation to their server and process the data along with the data acquisition automate. They also developed apps for xcms online, but I think apps for slack would be even cooler to control the data processing.
xcms is different from xcms online while they might share the same code. I used it almost every data to run local metabolomics data analysis. Recently, they will change their version to xcms 3 with major update for object class. Their data format would integrate into the MSnbase package and the parameters would be easy to set up for each step. Normally, I will use msconvert-IPO-xcms-xMSannotator-metaboanalyst as workflow to process the offline data. It could accelerate the process by parallel processing. However, if you are not familiar with R, you would better to choose some software below.
IPO A Tool for automated Optimization of XCMS Parameters (Libiseller et al. 2015) and Warpgroup is used for chromatogram subregion detection, consensus integration bound determination and accurate missing value integration(Nathaniel G. Mahieu, Genenbacher, and Patti 2016a). Another option is AutoTuner, which are much faster than IPO(McLean and Kujawinski 2020). Recently, MetaboAnalystR 3.0 could also optimize the parameters for xcms while you need to perform the following analysis within this software(Pang et al. 2020).
Check those papers for the XCMS based workflow(Forsberg et al. 2018; Huan et al. 2017; Nathaniel G. Mahieu, Genenbacher, and Patti 2016b; Montenegro-Burke et al. 2017; Domingo-Almenara and Siuzdak 2020). For metlin related annotation, check those papers(Guijas et al. 2018; Tautenhahn et al. 2012; Xue, Guijas, et al. 2020; Domingo-Almenara, Montenegro-Burke, Ivanisevic, et al. 2018).
MAIT based on xcms and you could find source code here(Fernández-Albert et al. 2014).
iMet-Q is an automated tool with friendly user interfaces for quantifying metabolites in full-scan liquid chromatography-mass spectrometry (LC-MS) data (Chang et al. 2016)
compMS2Miner is an Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC–MS Data Sets. Here is related papers (Edmands et al. 2017; Edmands, Hayes, and Rappaport 2018; Edmands, Barupal, and Scalbert 2015).
mzMatch is a modular, open source and platform independent data processing pipeline for metabolomics LC/MS data written in the Java language, which could be coupled with xcms (Scheltema et al. 2011a; Creek et al. 2012). It also could be used for annotation with MetAssign(Daly et al. 2014a).
5.1.2 PRIMe
PRIMe is from RIKEN and UC Davis. They update their database frequently(Tsugawa et al. 2016). It supports mzML and major MS vendor formats. They defined own file format ABF and eco-system for omics studies. The software are updated almost everyday. You could use MS-DIAL for untargeted analysis and MRMOROBS for targeted analysis. For annotation, they developed MS-FINDER and statistic tools with excel. This platform could replaced the dear software from company and well prepared for MS/MS data analysis and lipidomics. They are open source, work on Windows and also could run within mathmamtics. However, they don’t cover pathway analysis. Another feature is they always show the most recently spectral records from public repositories. You could always get the updated MSP spectra files for your own data analysis.
For PRIMe based workflow, check those papers(Lai et al. 2018a; Matsuo et al. 2017; Treutler et al. 2016a; Tsugawa et al. 2015a; Tsugawa et al. 2016; Kind et al. 2018).
5.1.3 GNPS
GNPS is an open-access knowledge base for community-wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. It’s a straight forward annotation methods for MS/MS data. Feature-based molecular networking (FBMN) within GNPS could be coupled with xcms, openMS, MS-DIAL, MZmine2, and other popular software.
Check those papers for GNPS and related projects(Aron et al. 2020; Nothias et al. 2020; Scheubert et al. 2017; R. R. da Silva et al. 2018; M. Wang et al. 2016).
5.1.4 OpenMS & SIRIUS
OpenMS is another good platform for mass spectrum data analysis developed with C++. You could use them as plugin of KNIME. I suggest anyone who want to be a data scientist to get familiar with platform like KNIME because they supplied various API for different programme language, which is easy to use and show every steps for others. Also TOPPView in OpenMS could be the best software to visualize the MS data. You could always use the metabolomics workflow to train starter about details in data processing. pyOpenMS and OpenSWATH are also used in this platform. If you want to turn into industry, this platform fit you best because you might get a clear idea about solution and workflow.
Check those paper for OpenMS based workflow(Bertsch et al. 2011; Pfeuffer et al. 2017; Röst et al. 2014, 2016; Rurik et al. 2020; Alka et al. 2020a).
OpenMS could be coupled to SIRIUS 4 for annotation. Sirius is a new java-based software framework for discovering a landscape of de-novo identification of metabolites using single and tandem mass spectrometry. SIRIUS 4 project integrates a collection of our tools, including CSI:FingerID, ZODIAC and CANOPUS. Check those papers for SIRIUS based workflow(Dührkop et al. 2019, 2020; Alka et al. 2020b; Ludwig et al. 2020).
5.1.5 MZmine 2
MZmine 2 has three version developed on Java platform and the lastest version is included into MSDK. Similar function could be found from MZmine 2 as shown in XCMS online. However, MZmine 2 do not have pathway analysis. You could use metaboanalyst for that purpose. Actually, you could go into MSDK to find similar function supplied by ProteoSuite and Openchrom. If you are a experienced coder for Java, you should start here.
Check those papers for MZmine based workflow(Pluskal et al. 2010; Pluskal et al. 2020).
5.1.6 Emory MaHPIC
This platform is composed by several R packages from Emory University including apLCMS to collect the data, xMSanalyzer to handle automated pipeline for large-scale, non-targeted metabolomics data, xMSannotator for annotation of LC-MS data and Mummichog for pathway and network analysis for high-throughput metabolomics. This platform would be preferred by someone from environmental science to study exposome.
You could check those papers for Emory workflow(Uppal et al. 2013; Uppal, Walker, and Jones 2017; T. Yu et al. 2009; S. Li et al. 2013; Q. Liu et al. 2020).
5.1.7 Others
MAVEN from Princeton University (Melamud, Vastag, and Rabinowitz 2010; Clasquin, Melamud, and Rabinowitz 2012).
metabolomics is a CRAN package for analysis of metabolomics data.
autoGCMSDataAnal is a Matlab based comprehensive data analysis strategy for GC-MS-based untargeted metabolomics and AntDAS2 provided An automatic data analysis strategy for UPLC-HRMS-based metabolomics(Y.-J. Yu et al. 2019; Y.-Y. Zhang et al. 2020).
enviGCMS from environmental non-targeted analysis and rmwf for reproducible metabolomics workflow (M. Yu and Petrick 2020; M. Yu, Olkowicz, and Pawliszyn 2019a).
Pseudotargeted metabolomics method (Zheng et al. 2020; Y. Wang et al. 2016).
pySM provides a reference implementation of our pipeline for False Discovery Rate-controlled metabolite annotation of high-resolution imaging mass spectrometry data (Palmer et al. 2017).
TinyMS is a Python-Based Pipeline for Preprocessing LC–MS Data for Untargeted Metabolomics Workflows (Riquelme et al. 2020)
MetaboliteDetector is a QT4 based software package for the analysis of GC/MS based metabolomics data (Hiller et al. 2009).
W4M and metaX could analysis data online (Giacomoni et al. 2015; Wen et al. 2017; Jalili et al. 2020).
FTMSVisualization is a suite of tools for visualizing complex mixture FT-MS data (Kew et al. 2017)
magma could predict and match MS/MS files.
5.1.8 Workflow Comparison
Here are some comparisons for different workflow and you could make selection based on their works(Myers et al. 2017; Weber et al. 2017; Z. Li et al. 2018).
5.2 Project Setup
I suggest building your data analysis projects in RStudio (Click File - New project - New dictionary - Empty project). Then assign a name for your project. I also recommend the following tips if you are familiar with it.
Use git/github to make version control of your code and sync your project online.
Don’t use your name for your project because other peoples might cooperate with you and someone might check your data when you publish your papers. Each project should be a work for one paper or one chapter in your thesis.
Use workflow document(txt or doc) in your project to record all of the steps and code you performed for this project. Treat this document as digital version of your experiment notebook
Use data folder in your project folder for the raw data and the results you get in data analysis
Use figure folder in your project folder for the figure
Use munuscript folder in your project folder for the manuscript (you could write paper in rstudio with the help of template in Rmarkdown)
Just double click [yourprojectname].Rproj to start your project
5.3 Data sharing
See this paper(Haug, Salek, and Steinbeck 2017):
MetaboLights EU based
The Metabolomics Workbench US based
MetabolomeXchange search engine
MetabolomeExpress a public place to process, interpret and share GC/MS metabolomics datasets(Carroll, Badger, and Harvey Millar 2010).
5.4 Contest
- CASMI predict small molecular contest(Blaženović et al. 2017)