Selection of open source software platform for metabolomics or lipidomics - Miao Yu

Recently I reviewed some platforms for metabolomics or lipidomics and tried to find out which one is currently available to my demands. I knew some instrumental company supplied software or solutions for metabolomics and lipidomics. I just disliked them since they concealed too many details about data processing which made omics studies as a magic box like Artificial Neural Network. It gains nothing for scientist to get insights from the data and you might only follow the workflow defined by the company. I read some papers using those kind of software and felt the authors know little about what they performed. Data analysis for metabolomics or lipidomics is a systems engineering. If someone really want to use this tools, just choose a open source platform and inspect the code when you needed. Otherwise, your work could be replaced by AI someday. It’s not a joke. Here is a summary for the selection of mass spectrum based metabolomics or lipidomics software platform.

Principles for selection

A decent solution should have functions covering the following required functions or features:

Open source: as I said before, researcher could benefit a lot from open source community. Based on your experiences, Java, matlab, python and R would be your choices. I liked R most while python might dominate everything in the further.
Data input/output: this work could be done by msconvert from proteowizard. I preferred to install msconvert directly on the instruments and convert the vendor files into mzML or mzXML files to perform further analysis. I just dislike Windows. On the other hand, some software from the instruments could output CDF files, which is also nice.
Turn the Raw data into mass-retention time matrix: single data file from mass spectrum would always contain matadata and profile data. The former container has records for that data and the latter were always full scans of mass spectrum indexed by retention time. Then you need to map such data into a matrix because the following peaks picking and alignments always based on algorithms for matrix.
Peaks picking: you also need functions to pick up the peaks from mass-retention time matrix. Functions like centWave or isotope-based algorithms would help to find the peaks. Sometimes you need to bin the matrix first before peak picking and such requirements should be carefully checked. Anyway you could just use one function to perform peaks picking in most software. The output of this step would be a peak list with intensity, mass and retention time.
Retention time correction: when you process more than one data file such as technique replicates, the drift of retention time would make the final peak list contained replicated peaks. Someone use certain peaks to corrected the data, while I prefer some global similarity analysis based algorithm to undertake this task. However, some software might add one step to group the peaks before and/or after retention time correction. Anyway, the output of this step is the peak list from one group.
Batch effects correction: this is another issue from analysis procedure and batch effects would also bury signals from noise. Researchers from analytical chemistry always use random experiment design and pooled sample to check if the batch effects exist. However, I think the most important things were correction. Just treat a series of samples suffered batch effects and use some statistical analysis to correct them.
Statistical analysis: when you get the peaks list, you could perform the following supervised or non-supervised analysis to find bio-markers or data pattern. I think Metaboanalyst could satisfy most researchers.
Annotation or identification: the peaks annotation is also important. HMDB and LIPID MAPS would be the basic version for mapping peaks to compounds. In most cases, MS/MS data would be used. I think most of MS/MS dataset could be separated handled by GNPS or Metlin and predicted by FingerID.
Pathway analysis: sometimes we also need to know the relationship among compounds and pathway analysis is always performed for that purpose.

Platform

XCMS online

XCMS online is hosted by Scripps Institute. If your datasets are not large, XCMS online would be the best option for you. Recently they updated the online version to support more functions for systems biology. They use metlin and iso metlin to annotate the MS/MS data. Pathway analysis is also supported. Besides, to accelerate the process, xcms online employed stream (windows only). You could use stream to connect your instrument workstation to their server and process the data along with the data acquisition automate. They also developed apps for xcms online, but I think apps for slack would be even cooler to control the data processing.

PRIMe

PRIMe is from RIKEN and UC Davis. It supports mzML and major MS vendor formats. They defined own file format ABF and eco-system for omics studies. The software are updated almost everyday. You could use MS-DIAL for untargeted analysis and MRMOROBS for targeted analysis. For annotation, they developed MS-FINDER and they also developed statistic tools with excel. This platform could replaced the dear software from company and well prepared for MS/MS data analysis and lipidomics. They are open source, work on Windows and also could run within mathmamtics. However, they don’t cover pathway analysis. Another feature is they always show the most recently spectral records from public repositories. You could always get the updated MSP spectra files for your own data analysis.

OpenMS

OpenMS is another good platform for mass spectrum data analysis developed with C++. You could use them as plugin of KNIME. I suggest anyone who want to be a data scientist to get familiar with platform like KNIME because they supplied various API for different programme language, which is easy to use and show every steps for others. Also TOPPView in OpenMS could be the best software to visualize the MS data. You could always use the metabolomics workflow to train starter about details in data processing. pyOpenMS and OpenSWATH are also used in this platform. If you want to turn into industry, this platform fit you best because you might get a clear idea about solution and workflow.

MZmine 2

MZmine 2 has three version developed on Java platform and the lastest version is included into MSDK. Similar function could be found from MZmine 2 as shown in XCMS online. However, MZmine 2 do not have pathway analysis. You could use metaboanalyst for that purpose. Actually, you could go into MSDK to find similar function supplied by ProteoSuite and Openchrom. If you are a experienced coder for Java, you should start here.

XCMS

xcms is different from xcms online while they might share the same code. I used it almost every data to run local metabolomics data analysis. Recently, they will change their version to xcms 3 with major update for object class. Their data format would integrate into the MSnbase package and the parameters would be easy to set up for each step. Normally, I will use msconvert-IPO-xcms-xMSannotator-metaboanalyst as workflow to process the offline data. It could accelerate the process by parallel processing. However, if you are not familiar with R, you would better to choose some software above.

Emory MaHPIC

This platform is composed by several R packages from Emory University including apLCMS to collect the data, xMSanalyzer to handle automated pipeline for large-scale, non-targeted metabolomics data, xMSannotator for annotation of LC-MS data and Mummichog for pathway and network analysis for high-throughput metabolomics. This platform would be preferred by someone from environmental science to study exposome. I always use xMSannotator to annotate the LC-MS data.

Others

MAVEN from Princeton University
RAMclustR from Colorado State University
MAIT based on xcms
enviGCMS from me
Metabolights for sharing data

Selection

All of the platform above could be used for metabolomics and lipidomics. However, best selection would be based on your programming skills and the popularity in your research area. Every tool need training before analysis your data and you could choose one randomly and be focused on the source code for one day. If you feel you could handle it, just use it. Otherwise select another one. Enjoy and fight with the data.