Play with HMDB datasets: Part I - Miao Yu

The Human Metabolome Database is always used to make annotation for metabolomics studies. However, I am not sure which version of hmdb used in some packages like xMSannotator or MAIT. Then I suddenly realized I could directly grab the most updated version from their website and make annotation with those data. When you have a hammer, you always want to hit a nail.

The most updated version is from 2018-04-10 in xml formats. Well, since I deal with the rss xml in the text mining post, I think such file is not a big deal. However, when I de-compressed the file, I found the size of the raw singal xml file is almost 3G, similar to the whole human genomics. However, I checked and found lots of the information for each compounds are references and I think data clean should be down.

My Macbook Pro is a begger version with 8GB memory and 3G would make it impossible for R to process the data. Actually I try R and the procudure would run forever. Here is the R code:

library(tidyverse)
library(xml2)
library(stringr)
hmdb <- read_xml('hmdb_metabolites.xml')%>%
as_list()
getremove <- function(list){
  list$synthesis_reference <- NULL
  list$general_references <- NULL
  list$version <- NULL
  list$creation_date <- NULL
  list$update_date <- NULL
  list$secondary_accessions <- NULL
  list$description <- NULL
  list$synonyms <- NULL
  list$traditional_iupac <- NULL
  list$spectra <- NULL
  list$predicted_properties <- NULL
  list$taxonomy$alternative_parents <- NULL
  list$taxonomy$substituents <- NULL
  list$taxonomy$description<- NULL
  list$normal_concentrations <- NULL
  list$abnormal_concentrations <- NULL
  list$diseases <- NULL
  list$ontology <- NULL
  list$direct_parent <- list$taxonomy$direct_parent
  list$kingdom <- list$taxonomy$kingdom
  list$super_class <- list$taxonomy$super_class
  list$class <- list$taxonomy$class
  list$sub_class <- list$taxonomy$sub_class
  list$molecular_framework <- list$taxomomy$molecular_framework
  list$taxonomy <- NULL
  list$protein_associations <- NULL
  list$biofluid_locations <- NULL
  list$pathways <- NULL
  list$tissue_locations <- NULL
  list$experimental_properties <- NULL
  return(list)
}
subhmdb <- lapply(hmdb,getremove)
id = unlist(sapply(subhmdb, "[[", "accession"))
name = unlist(sapply(subhmdb, "[[", "name"))
hmdbclass = sapply(subhmdb, "[[", "class")
hmdbclass2 = sapply(hmdbclass, unlist)
class <- as.character(hmdbclass2)
hmdbsubclass = sapply(subhmdb, "[[", "sub_class")
hmdbsubclass2 = sapply(hmdbsubclass, unlist)
subclass <- as.character(hmdbsubclass2)
hmdbMW = sapply(subhmdb, "[[", "monisotopic_molecular_weight")
hmdbMW2 = sapply(hmdbMW, unlist)
MW <- as.numeric(as.character(hmdbMW2))
table <- cbind.data.frame(id = id,name = name,class = class,subclass = subclass, MW = MW)
write.csv(table,file = 'hmdb.csv')

Then I thought I could use bash to process such xml. Although I finally get the files via xmlstarlet, I think the waiting time is too long.

xmlstarlet sel -N hmdb=http://www.hmdb.ca -T -t -m //hmdb:metabolite -v "concat(//hmdb:metabolite//hmdb:accession,',',//hmdb:metabolite//hmdb:monisotopic_molecular_weight,',',//hmdb:metabolite//hmdb:iupac_name,',',//hmdb:metabolite//hmdb:name,',',//hmdb:metabolite//hmdb:chemical_formula,',',//hmdb:metabolite//hmdb:cas_registry_number,',',//hmdb:metabolite//hmdb:smiles,',',//hmdb:metabolite//hmdb:kingdom,',',//hmdb:metabolite//hmdb:direct_parent,',',//hmdb:metabolite//hmdb:taxonomy//hmdb:super_class,',',//hmdb:metabolite//hmdb:taxonomy//hmdb:class,',',//hmdb:metabolite//hmdb:taxonomy//hmdb:sub_class, ',',//hmdb:metabolite//hmdb:taxonomy//hmdb:molecular_framework)" -n hmdb_metabolites.xml > hmdb.csv

Then I think it’s time for python. I didn’t use python too much. After struglling with some issues between python 2 and python 3 , finally I make a dirty code as follows:

# delete the xmlns="http://www.hmdb.ca" in the second line of xml space
from lxml import etree
import csv
xml = 'hmdb.xml'

context = etree.iterparse(xml, tag='metabolite')

csvfile = open('hmdb.csv', 'w')
fieldnames = ['accession', 'monisotopic_molecular_weight', 'iupac_name', 'name', 'chemical_formula', 'cas_registry_number', 'smiles', 'kingdom', 'direct_parent', 'super_class', 'class', 'sub_class', 'molecular_framework']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()

for event, elem in context:

    accession = elem.xpath('accession/text()')[0]
    try:
        monisotopic_molecular_weight = elem.xpath('monisotopic_molecular_weight/text()')[0]
    except:
        monisotopic_molecular_weight = 'NA'
    try:
        iupac_name = elem.xpath('iupac_name/text()')[0].encode('utf-8')
    except:
        iupac_name = 'NA'
    name = elem.xpath('name/text()')[0].encode('utf-8')
    try:
        chemical_formula = elem.xpath('chemical_formula/text()')[0]
    except:
        chemical_formula = 'NA'

    try:
        cas_registry_number = elem.xpath('cas_registry_number/text()')[0]
    except:
        cas_registry_number = 'NA'
    try:
        smiles = elem.xpath('smiles/text()')[0]
    except:
        smiles = 'NA'
    try:
        kingdom = elem.xpath('taxonomy/kingdom/text()')[0]
    except:
        kingdom = 'NA'
    try:
        direct_parent = elem.xpath('taxonomy/direct_parent/text()')[0]
    except:
        direct_parent = 'NA'
    try:
        super_class = elem.xpath('taxonomy/super_class/text()')[0]
    except:
        super_class = 'NA'
    try:
        classorg = elem.xpath('taxonomy/class/text()')[0]
    except:
        classorg = 'NA'
    try:
        sub_class = elem.xpath('taxonomy/sub_class/text()')[0]
    except:
        sub_class = 'NA'
    try:
        molecular_framework = elem.xpath('taxonomy/molecular_framework/text()')[0]
    except:
        molecular_framework = 'NA'

    writer.writerow({'accession': accession, 'monisotopic_molecular_weight': monisotopic_molecular_weight, 'iupac_name': iupac_name, 'name': name, 'chemical_formula': chemical_formula, 'cas_registry_number': cas_registry_number, 'smiles': smiles, 'kingdom': kingdom, 'direct_parent': direct_parent, 'super_class': super_class, 'class': classorg, 'sub_class': sub_class, 'molecular_framework': molecular_framework})
    # It's safe to call clear() here because no descendants will be
    # accessed
    elem.clear()
# Also eliminate now-empty references from the root node to elem
    for ancestor in elem.xpath('ancestor-or-self::*'):
        while ancestor.getprevious() is not None:
            del ancestor.getparent()[0]
del context

After 134.2s, it’s done. The size of csv file is around 50.5MB.

I will not show the csv files because if you really want the results, you’d better to run this code on the furture version of HMDB. If you don’t know how to use python, just learn it! If you already know one programming language, you could learn another high-level language in few minutes to run a script.

I have to mention that the devolopers of MS-DIAL always released an updated MS/MS spectrum database online. I wonder how many people spend time on those data. If you could find the source of the raw data, run them from the very beginning and you would know how complex to produce a realiable database.

OK, part I only cover the data clean part. I might show some data mining for those data in next posts. The take home message is that tools would not change the results while they could change the time to get the results. Learn something new and it would help you someday.