Chapter 7 Annotation
When you get the peaks table or features table, annotation of the peaks would help you. Check this review(Domingo-Almenara, Montenegro-Burke, Benton, et al. 2018a) or other reviews(Chaleckis et al. 2019; Lai et al. 2018b; Nash and Dunn 2019; Mark R. Viant et al. 2017; Allard, Genta-Jouve, and Wolfender 2017; Domingo-Almenara, Montenegro-Burke, Benton, et al. 2018b) for a detailed notes on annotation. The first paper proposed five levels regarding currently computational annotation strategies.
Level 1: Peak Grouping: MS Psedospectra extraction based on peak shape similarity and peak abundance correlation
Level 2: Peak Annotation: Adducts, Neutral losses, isotopes, and other mass relationships based on mass distances
Level 3: Biochemical knowledge based on putative identification, potential biochemical reaction and related statistical analysis
Level 4: Use and integration of tandem MS data based on data dependent/independent acquisition mode or in silico prediction
Level 5: Retention time prediction based on library-available retention index or quantitative structure-retention relationships (QSRR) models.
Most of the software are at level 1 or 2. If we only have compounds structure, we could guess ions under different ionization method. If we have mass spectrum, we could match the mass spectral by a similarity analysis to the database. In metabolomics, we only have mass spectrum or mass-to-charge ratios. Single mass-to-charge ratio is not enough for identification. That’s the one bottleneck for annotation. So prediction is always performed on MS/MS data.
7.1 Issues in annotation
The major issue in annotation is the redundancy peaks from same metabolite. Unlike genomes, peaks or features from peak selection are not independent with each other. Adducts, in-source fragments and isotopes would lead to wrong annotation. A common solution is that use known adducts, neutral losses, molecular multimers or multiple charged ions to compare mass distances.
Another issue is about the MS/MS database. Only 10% of known metabolites in databases have experimental spectral data. Thus in silico prediction is required. Some works try to fill the gap between experimental data, theoretical values(from chemical database like chemspider) and prediction together. Here is a nice review about MS/MS prediction(Hufsky, Scheubert, and Böcker 2014).
7.2 Peak misidentification
Use separation methods such as chromatography, ion mobility MS, MS/MS. Reversed-phase ion-pairing chromatography and HILIC is useful. Chemical derivatization is another option.
- Interfering compounds
20ppm is the least exact mass accuracy for HRMS.
- In-source degradation products
7.3 Annotation v.s. identification
According to the definition from the Chemical Analysis Working Group of the Metabolomics Standards Intitvative(Lloyd W. Sumner et al. 2007; Mark R. Viant et al. 2017). Four levels of confidence could be assigned to identification:
- Level 1 ‘identified metabolites’
- Level 2 ‘Putatively annotated compounds’
- Level 3 ‘Putatively characterised compound classes’
- Level 4 ‘Unknown’
In practice, data analysis based annotation could reach level 2. For level 1, we need at extra methods such as MS/MS, retention time, accurate mass, 2D NMR spectra, and so on to confirm the compounds. However, standards are always required for solid proof.
Through MS/MS seemed a required step for identification, recent study found ESI might also generate fragments ions for structure identification (Xue, Domingo-Almenara, et al. 2020).
7.4 Molecular Formula Assignment
Cheminformatics will help for MS annotation. The first task is molecular formula assignment. For a given accurate mass, the formula should be constrained by predefined element type and atom number, mass error window and rules of chemical bonding, such as double bond equivalent (DBE) and the nitrogen rule. The nitrogen rule is that an odd nominal molecular mass implies also an odd number of nitrogen. This rule should only be used with nominal (integer) masses. Degree of unsaturation or DBE use rings-plus-double-bonds equivalent (RDBE) values, which should be interger. The elements oxygen and sulphur were not taken into account. Otherwise the molecular formula will not be true.
\[RDBE = C+Si - 1/2(H+F+Cl+Br+I) + 1/2(N+P)+1 \]
To assign molecular formula to a mass to charge ratio, Seven Golden Rules (Kind and Fiehn 2007) for heuristic filtering of molecular formulas should be considered:
- Apply heuristic restrictions for number of elements during formula generation. This is the table for known compounds:
## Mass.Range.[Da] Library C.max H.max N.max O.max P.max S.max F.max Cl.max ## 1 < 500 DNP 29 72 10 18 4 7 15 8 ## 2 <NA> Wiley 39 72 20 20 9 10 16 10 ## 3 < 1000 DNP 66 126 25 27 6 8 16 11 ## 4 <NA> Wiley 78 126 20 27 9 14 34 12 ## 5 < 2000 DNP 115 236 32 63 6 8 16 11 ## 6 <NA> Wiley 156 180 20 40 9 14 48 12 ## 7 < 3000 DNP 162 208 48 78 6 9 16 11 ## Br.max Si.max ## 1 5 NA ## 2 4 8 ## 3 8 NA ## 4 8 14 ## 5 8 NA ## 6 10 15 ## 7 8 NA
Perform LEWIS and SENIOR check. The LEWIS rule demands that molecules consisting of main group elements, especially carbon, nitrogen and oxygen, share electrons in a way that all atoms have completely filled s, p-valence shells (‘octet rule’). Senior’s theorem requires three essential conditions for the existence of molecular graphs
The sum of valences or the total number of atoms having odd valences is even;
The sum of valences is greater than or equal to twice the maximum valence;
The sum of valences is greater than or equal to twice the number of atoms minus 1.
Perform isotopic pattern filter. Isotope ratio abundance was included in the algorithm as an additional orthogonal constraint, assuming high quality data acquisitions, specifically sufficient ion statistics and high signal/noise ratio for the detection of the M+1 and M+2 abundances. For monoisotopic elements (F, Na, P, I) this rule has no impact. isotope pattern will be useful for brominated, chlorinated small molecules and sulphur-containing peptides.
Perform H/C ratio check (hydrogen/carbon ratio). In most cases the hydrogen/carbon ratio does not exceed H/C > 3 with rare exception such as in methylhydrazine (CH6N2). Conversely, the H/C ratio is usually smaller than 2, and should not be less than 0.125 like in the case of tetracyanopyrrole (C8HN5).
Perform NOPS ratio check (N, O, P, S/C ratios).
## Element.ratios Common.range.(covering.99.7%) Extended.range.(covering.99.99%) ## 1 H/C 0.2–3.1 0.1–6 ## 2 F/C 0–1.5 0–6 ## 3 Cl/C 0–0.8 0–2 ## 4 Br/C 0–0.8 0–2 ## 5 N/C 0–1.3 0–4 ## 6 O/C 0–1.2 0–3 ## 7 P/C 0–0.3 0–2 ## 8 S/C 0–0.8 0–3 ## 9 Si/C 0–0.5 0–1 ## Extreme.range.(beyond.99.99%) ## 1 < 0.1 and 6–9 ## 2 > 1.5 ## 3 > 0.8 ## 4 > 0.8 ## 5 > 1.3 ## 6 > 1.2 ## 7 > 0.3 ## 8 > 0.8 ## 9 > 0.5
- Perform heuristic HNOPS probability check (H, N, O, P, S/C high probability ratios)
<- data.frame( df stringsAsFactors = FALSE, Element.counts = c("NOPS all > 1","NOP all > 3","OPS all > 1", "PSN all > 1","NOS all > 6"), Heuristic.Rule = c("N< 10, O < 20, P < 4, S < 3", "N < 11, O < 22, P < 6","O < 14, P < 3, S < 3", "P < 3, S < 3, N < 4","N < 19 O < 14 S < 8"), DB.examples.for.maximum.values = c("C15H34N9O8PS, C22H44N4O14P2S2, C24H38N7O19P3S","C20H28N10O21P4, C10H18N5O20P5", "C22H44N4O14P2S2, C16H36N4O4P2S2", "C22H44N4O14P2S2, C16H36N4O4P2S2","C59H64N18O14S7") )df
## Element.counts Heuristic.Rule ## 1 NOPS all > 1 N< 10, O < 20, P < 4, S < 3 ## 2 NOP all > 3 N < 11, O < 22, P < 6 ## 3 OPS all > 1 O < 14, P < 3, S < 3 ## 4 PSN all > 1 P < 3, S < 3, N < 4 ## 5 NOS all > 6 N < 19 O < 14 S < 8 ## DB.examples.for.maximum.values ## 1 C15H34N9O8PS, C22H44N4O14P2S2, C24H38N7O19P3S ## 2 C20H28N10O21P4, C10H18N5O20P5 ## 3 C22H44N4O14P2S2, C16H36N4O4P2S2 ## 4 C22H44N4O14P2S2, C16H36N4O4P2S2 ## 5 C59H64N18O14S7
- Perform TMS check (for GC-MS if a silylation step is involved). For TMS derivatized molecules detected in GC/MS analyses, the rules on element ratio checks and valence tests are hence best applied after TMS groups are subtracted, in a similar manner as adducts need to be first recognized and subtracted in LC/MS analyses.
Seven Golden Rules were built for GC-MS and Hydrogen Rearrangement Rules were major designed for LC-CID-MS/MS(Tsugawa et al. 2016). Based on extensively curated database records and enthalpy calculations, “hydrogen rearrangement (HR) rules” could be extending the even-electron rule for carbon (C) and heteroatoms, oxygen (O), nitrogen (N), phosphorus (P), and sulfur (S). They used high abundance MS/MS peaks that exceeded 10% of their base peaks to identify common features in terms of 4 HR rules for positive mode and 5 HR rules for negative mode.
Seven Golden Rules and Hydrogen Rearrangement Rules might also be captured by statistical models. However, such heuristic rules could reduce the searching space of possible formula.
molgen generating all structures (connectivity isomers, constitutions) that correspond to a given molecular formula, with optional further restrictions, e.g. presence or absence of particular substructures (Gugisch et al. 2015).
RAMSI is the robust automated mass spectra interpretation and chemical formula calculation method using mixed integer linear programming optimization (Baran and Northen 2013).
Here is some other Cheminformatics tools, which could be used to assign meaningful formula or structures for mass spectra.
- RDKit Open-Source Cheminformatics Software
- cdk The Chemistry Development Kit (CDK) is a scientific, LGPL-ed library for bio- and cheminformatics and computational chemistry written in Java (Guha 2007).
- Open Babel Open Babel is a chemical toolbox designed to speak the many languages of chemical data (O’Boyle et al. 2011).
- ClassyFire is a tool for automated chemical classification with a comprehensive, computable taxonomy (Djoumbou Feunang et al. 2016).
7.5 Redundant peaks
Full scan mass spectra always contain lots of redundant peaks such as adducts, isotope, fragments, multiple charged ions and other oligomers. Such peaks dominated the features table(Xu, Lu, and Rabinowitz 2015; Sindelar and Patti 2020; Nathaniel G. Mahieu and Patti 2017). Annotation tools could label those peaks either by known list or frequency analysis of the paired mass distances(Ju et al. 2020; Kouřil et al., n.d.).
7.5.1 Adducts list
You could find adducts list here from commonMZ project.
Here is Isotope pattern prediction.
BioCAn combines the results from database searches and in silico fragmentation analyses and places these results into a relevant biological context for the sample as captured by a metabolic model (Alden et al. 2017).
mzMatch is a modular, open source and platform independent data processing pipeline for metabolomics LC/MS data written in the Java language. (Chokkathukalam et al. 2013; Scheltema et al. 2011b) and MetAssign is a probabilistic annotation method using a Bayesian clustering approach, which is part of mzMatch(Daly et al. 2014b).
CliqueMS is a computational tool for annotating in-source metabolite ions from LC-MS untargeted metabolomics data based on a coelution similarity network (Senan et al. 2019).
7.6 MS/MS annotation
You could check Workflow section for popular platform. Here are some stand-alone annotation software:
MetDNA is the Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics (Shen et al. 2019).
MS2Analyzer could annotate small molecule substructure from accurate tandem mass spectra. (Ma et al. 2014)
Bar coding select mass-to-charge regions containing the most informative metabolite fragments and designate them as bins. Then translate each metabolite fragmentation pattern into a binary code by assigning 1’s to bins containing fragments and 0’s to bins without fragments. Such coding annotation could be used for MRM data (Spalding et al. 2016).
7.7 Knowledge based annotation
Provides probability ranking to candidate compounds assigned to masses, with the prior assumption of connected sample and additional previous and spectral information modeled by the user. You could find source code here (R. R. Silva et al. 2014).
MetExpert is an expert system to assist users with limited expertise in informatics to interpret GCMS data for metabolite identification without querying spectral databases (Qiu, Lei, and Sumner 2018).
7.8 MS Database for annotation
NIST: No free
MINE is an open access database of computationally predicted enzyme promiscuity products for untargeted metabolomics
MINE is an open access database of computationally predicted enzyme promiscuity products for untargeted metabolomics. The annotation would be accurate for general compounds database.
MoNA Platform to collect all other open source database
GNPS use inner correlationship in the data and make network analysis at peaks’ level instand of annotated compounds to annotate the data.
LipidBlast: in silico prediction
NIST: Not free
GMDB a multistage tandem mass spectral database using a variety of structurally defined glycans.
HMDB is a freely available electronic database containing detailed information about small molecule metabolites found in the human body.
KEGG is a collection of small molecules, biopolymers, and other chemical substances that are relevant to biological systems.
7.9 Compounds Database
PubChem is an open chemistry database at the National Institutes of Health (NIH).
Chemspider is a free chemical structure database providing fast text and structure search access to over 67 million structures from hundreds of data sources.
ChEBI is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds.
RefMet A Reference list of Metabolite names.
CAS Largest substance database
T3DB is a unique bioinformatics resource that combines detailed toxin data with comprehensive toxin target information.
FooDB is the world’s largest and most comprehensive resource on food constituents, chemistry and biology.
Phenol explorer is the first comprehensive database on polyphenol content in foods.
Drugbank is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information.
LMDB is a freely available electronic database containing detailed information about small molecule metabolites found in different livestock species.
HPV High Production Volume Information System