Brouard Céline, Shen Huibin, Dührkop Kai, d'Alché-Buc Florence, Böcker Sebastian, Rousu Juho
Department of Computer Science, Aalto University, Espoo, Finland Helsinki Institute for Information Technology, Espoo, Finland.
Chair for Bioinformatics, Friedrich-Schiller University, Jena, Germany.
Bioinformatics. 2016 Jun 15;32(12):i28-i36. doi: 10.1093/bioinformatics/btw246.
An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space.
We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods.
Supplementary data are available at Bioinformatics online.
代谢组学的一个重要问题是利用串联质谱数据识别代谢物。最近有人提出机器学习方法来解决这个问题,即预测分子指纹向量并将这些指纹与现有的分子结构数据库进行匹配。在这项工作中,我们建议使用结构化输出预测方法来解决代谢物识别问题。这种方法不限于向量输出空间,还可以处理结构化输出空间,如分子空间。
我们使用输入输出核回归方法来学习串联质谱与分子结构之间的映射。该方法的原理是使用两个核函数对输入(光谱)空间中的相似性和输出(分子)空间中的相似性进行编码。此方法分两个阶段近似光谱-分子映射。第一阶段对应于从输入空间到与输出核相关的特征空间的回归问题。第二阶段是一个原像问题,包括将预测的输出特征向量映射回分子空间。我们表明,我们的方法在代谢物识别方面达到了当前的先进精度。此外,与之前的方法相比,我们的方法具有将训练步骤和测试步骤的运行时间减少几个数量级的优势。
补充数据可在《生物信息学》在线获取。