Chapter 5 Workflow

You could check this book for metabolomics data analysis (S. Li 2020).

5.1 Platform for metabolomics data analysis

Here is a list for related open source projects

5.1.1 XCMS & XCMS online

XCMS online is hosted by Scripps Institute. If your datasets are not large, XCMS online would be the best option for you. Recently they updated the online version to support more functions for systems biology. They use metlin and iso metlin to annotate the MS/MS data. Pathway analysis is also supported. Besides, to accelerate the process, xcms online employed stream (windows only). You could use stream to connect your instrument workstation to their server and process the data along with the data acquisition automate. They also developed apps for xcms online, but I think apps for slack would be even cooler to control the data processing.

xcms is different from xcms online while they might share the same code. I used it almost every data to run local metabolomics data analysis. Recently, they will change their version to xcms 3 with major update for object class. Their data format would integrate into the MSnbase package and the parameters would be easy to set up for each step. Normally, I will use msconvert-IPO-xcms-xMSannotator-metaboanalyst as workflow to process the offline data. It could accelerate the process by parallel processing. However, if you are not familiar with R, you would better to choose some software below.

IPO A Tool for automated Optimization of XCMS Parameters (Libiseller et al. 2015) and Warpgroup is used for chromatogram subregion detection, consensus integration bound determination and accurate missing value integration(Nathaniel G. Mahieu, Genenbacher, and Patti 2016a). Another option is AutoTuner, which are much faster than IPO(McLean and Kujawinski 2020). Recently, MetaboAnalystR 3.0 could also optimize the parameters for xcms while you need to perform the following analysis within this software(Pang et al. 2020).

Check those papers for the XCMS based workflow(Forsberg et al. 2018; Huan et al. 2017; Nathaniel G. Mahieu, Genenbacher, and Patti 2016b; Montenegro-Burke et al. 2017; Domingo-Almenara and Siuzdak 2020). For metlin related annotation, check those papers(Guijas et al. 2018; Tautenhahn et al. 2012; Xue, Guijas, et al. 2020; Domingo-Almenara, Montenegro-Burke, Ivanisevic, et al. 2018).

MAIT based on xcms and you could find source code here(Fernández-Albert et al. 2014).

iMet-Q is an automated tool with friendly user interfaces for quantifying metabolites in full-scan liquid chromatography-mass spectrometry (LC-MS) data (Chang et al. 2016)

compMS2Miner is an Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC–MS Data Sets. Here is related papers (Edmands et al. 2017; Edmands, Hayes, and Rappaport 2018; Edmands, Barupal, and Scalbert 2015).

mzMatch is a modular, open source and platform independent data processing pipeline for metabolomics LC/MS data written in the Java language, which could be coupled with xcms (Scheltema et al. 2011a; Creek et al. 2012). It also could be used for annotation with MetAssign(Daly et al. 2014a).

5.1.2 PRIMe

PRIMe is from RIKEN and UC Davis. They update their database frequently(Tsugawa et al. 2016). It supports mzML and major MS vendor formats. They defined own file format ABF and eco-system for omics studies. The software are updated almost everyday. You could use MS-DIAL for untargeted analysis and MRMOROBS for targeted analysis. For annotation, they developed MS-FINDER and statistic tools with excel. This platform could replaced the dear software from company and well prepared for MS/MS data analysis and lipidomics. They are open source, work on Windows and also could run within mathmamtics. However, they don’t cover pathway analysis. Another feature is they always show the most recently spectral records from public repositories. You could always get the updated MSP spectra files for your own data analysis.

For PRIMe based workflow, check those papers(Lai et al. 2018a; Matsuo et al. 2017; Treutler et al. 2016a; Tsugawa et al. 2015a; Tsugawa et al. 2016; Kind et al. 2018).

5.1.3 GNPS

GNPS is an open-access knowledge base for community-wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. It’s a straight forward annotation methods for MS/MS data. Feature-based molecular networking (FBMN) within GNPS could be coupled with xcms, openMS, MS-DIAL, MZmine2, and other popular software.

Check those papers for GNPS and related projects(Aron et al. 2020; Nothias et al. 2020; Scheubert et al. 2017; R. R. da Silva et al. 2018; M. Wang et al. 2016).

5.1.4 OpenMS & SIRIUS

OpenMS is another good platform for mass spectrum data analysis developed with C++. You could use them as plugin of KNIME. I suggest anyone who want to be a data scientist to get familiar with platform like KNIME because they supplied various API for different programme language, which is easy to use and show every steps for others. Also TOPPView in OpenMS could be the best software to visualize the MS data. You could always use the metabolomics workflow to train starter about details in data processing. pyOpenMS and OpenSWATH are also used in this platform. If you want to turn into industry, this platform fit you best because you might get a clear idea about solution and workflow.

Check those paper for OpenMS based workflow(Bertsch et al. 2011; Pfeuffer et al. 2017; Röst et al. 2014, 2016; Rurik et al. 2020; Alka et al. 2020a).

OpenMS could be coupled to SIRIUS 4 for annotation. Sirius is a new java-based software framework for discovering a landscape of de-novo identification of metabolites using single and tandem mass spectrometry. SIRIUS 4 project integrates a collection of our tools, including CSI:FingerID, ZODIAC and CANOPUS. Check those papers for SIRIUS based workflow(Dührkop et al. 2019, 2020; Alka et al. 2020b; Ludwig et al. 2020).

5.1.5 MZmine 2

MZmine 2 has three version developed on Java platform and the lastest version is included into MSDK. Similar function could be found from MZmine 2 as shown in XCMS online. However, MZmine 2 do not have pathway analysis. You could use metaboanalyst for that purpose. Actually, you could go into MSDK to find similar function supplied by ProteoSuite and Openchrom. If you are a experienced coder for Java, you should start here.

Check those papers for MZmine based workflow(Pluskal et al. 2010; Pluskal et al. 2020).

5.1.6 Emory MaHPIC

This platform is composed by several R packages from Emory University including apLCMS to collect the data, xMSanalyzer to handle automated pipeline for large-scale, non-targeted metabolomics data, xMSannotator for annotation of LC-MS data and Mummichog for pathway and network analysis for high-throughput metabolomics. This platform would be preferred by someone from environmental science to study exposome.

You could check those papers for Emory workflow(Uppal et al. 2013; Uppal, Walker, and Jones 2017; T. Yu et al. 2009; S. Li et al. 2013; Q. Liu et al. 2020).

5.1.7 Others

5.1.8 Workflow Comparison

Here are some comparisons for different workflow and you could make selection based on their works(Myers et al. 2017; Weber et al. 2017; Z. Li et al. 2018).

5.2 Project Setup

I suggest building your data analysis projects in RStudio (Click File - New project - New dictionary - Empty project). Then assign a name for your project. I also recommend the following tips if you are familiar with it.

  • Use git/github to make version control of your code and sync your project online.

  • Don’t use your name for your project because other peoples might cooperate with you and someone might check your data when you publish your papers. Each project should be a work for one paper or one chapter in your thesis.

  • Use workflow document(txt or doc) in your project to record all of the steps and code you performed for this project. Treat this document as digital version of your experiment notebook

  • Use data folder in your project folder for the raw data and the results you get in data analysis

  • Use figure folder in your project folder for the figure

  • Use munuscript folder in your project folder for the manuscript (you could write paper in rstudio with the help of template in Rmarkdown)

  • Just double click [yourprojectname].Rproj to start your project

5.3 Data sharing

See this paper(Haug, Salek, and Steinbeck 2017):

5.4 Contest