Department of Computer Science, Tufts University, Medford, MA, 02155, United States.
Department of Chemical and Biological Engineering, Tufts University, Medford, MA, 02155, United States.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae490.
A key challenge in metabolomics is annotating measured spectra from a biological sample with chemical identities. Currently, only a small fraction of measurements can be assigned identities. Two complementary computational approaches have emerged to address the annotation problem: mapping candidate molecules to spectra, and mapping query spectra to molecular candidates. In essence, the candidate molecule with the spectrum that best explains the query spectrum is recommended as the target molecule. Despite candidate ranking being fundamental in both approaches, limited prior works incorporated rank learning tasks in determining the target molecule.
We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), for metabolite annotation. ESP takes advantage of prior neural network-based annotation models that utilize multilayer perceptron (MLP) networks and Graph Neural Networks (GNNs). Based on the ranking results of the MLP- and GNN-based models, ESP learns a weighting for the outputs of MLP and GNN spectral predictors to generate a spectral prediction for a query molecule. Importantly, training data is stratified by molecular formula to provide candidate sets during model training. Further, baseline MLP and GNN models are enhanced by considering peak dependencies through label mixing and multi-tasking on spectral topic distributions. When trained on the NIST 2020 dataset and evaluated on the relevant candidate sets from PubChem, ESP improves average rank by 23.7% and 37.2% over the MLP and GNN baselines, respectively, demonstrating performance gain over state-of-the-art neural network approaches. However, MLP approaches remain strong contenders when considering top five ranks. Importantly, we show that annotation performance is dependent on the training dataset, the number of molecules in the candidate set and candidate similarity to the target molecule.
The ESP code, a trained model, and a Jupyter notebook that guide users on using the ESP tool is available at https://github.com/HassounLab/ESP.
代谢组学中的一个关键挑战是将生物样本中的测量光谱与化学身份进行注释。目前,只有一小部分测量值可以分配身份。已经出现了两种互补的计算方法来解决注释问题:将候选分子映射到光谱,以及将查询光谱映射到分子候选物。从本质上讲,推荐与查询光谱最佳匹配的光谱的候选分子作为目标分子。尽管候选排名在这两种方法中都很基础,但在确定目标分子时,很少有先前的工作将排名学习任务纳入其中。
我们提出了一种新的机器学习模型,即集合光谱预测(ESP),用于代谢物注释。ESP 利用了基于神经网络的注释模型,这些模型利用多层感知器(MLP)网络和图神经网络(GNN)。基于基于 MLP 和 GNN 的模型的排名结果,ESP 为 MLP 和 GNN 光谱预测器的输出学习权重,以生成查询分子的光谱预测。重要的是,在模型训练期间,训练数据按分子公式分层,以提供候选集。此外,通过在光谱主题分布上进行标签混合和多任务处理,考虑到峰依赖性,增强了基线 MLP 和 GNN 模型。在 NIST 2020 数据集上进行训练,并在 PubChem 的相关候选集上进行评估时,ESP 相对于 MLP 和 GNN 基线分别将平均排名提高了 23.7%和 37.2%,这表明与最先进的神经网络方法相比具有性能优势。然而,在考虑前五名排名时,MLP 方法仍然是强有力的竞争者。重要的是,我们表明注释性能取决于训练数据集、候选集的分子数量以及候选物与目标物的相似性。
ESP 代码、经过训练的模型和一个 Jupyter 笔记本,该笔记本指导用户使用 ESP 工具,可在 https://github.com/HassounLab/ESP 上获得。