Kalia Apurva, Zhou Chen Yan, Krishnan Dilip, Hassoun Soha
Department of Computer Science, Tufts University, Medford, MA 02155, United States.
Google DeepMind, Mountain View, CA 94043, United States.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf354.
A major challenge in metabolomics is annotation: assigning molecular structures to mass spectral fragmentation patterns. Despite recent advances in molecule-to-spectra and in spectra-to-molecular fingerprint (FP) prediction, annotation rates remain low.
We introduce in this article a novel tool (JESTR) for annotation. Unlike prior approaches that "explicitly" construct molecular FPs or spectra, JESTR leverages the insight that molecules and their corresponding spectra are views of the same data and effectively embeds their representations in a joint space. Candidate structures are ranked based on cosine similarity between the embeddings of query spectrum and each candidate. We evaluate JESTR against mol-to-spec, spec-to-FP, and spec-mol matching annotation tools on four datasets. On average, for rank@[1-20], JESTR outperforms other tools by 55.5%-302.6%. We further demonstrate the strong value of regularization with candidate molecules during training, boosting rank@1 performance by 5.72% across all datasets and enhancing the model's ability to discern between target and candidate molecules. When comparing JESTR's performance against that of publicly available pretrained models of SIRIUS and CFM-ID on appropriate subsets of MassSpecGym dataset, JESTR outperforms these tools by 31% and 238%, respectively. Through JESTR, we offer a novel promising avenue toward accurate annotation, therefore unlocking valuable insights into the metabolome.
Code and dataset available at https://github.com/HassounLab/JESTR1/.
代谢组学中的一个主要挑战是注释,即将分子结构与质谱碎片模式进行匹配。尽管在分子到光谱以及光谱到分子指纹(FP)预测方面取得了最新进展,但注释率仍然很低。
我们在本文中介绍了一种用于注释的新型工具(JESTR)。与之前“明确”构建分子FP或光谱的方法不同,JESTR利用了分子及其相应光谱是同一数据的不同视图这一见解,并有效地将它们的表示嵌入到一个联合空间中。候选结构根据查询光谱与每个候选结构的嵌入之间的余弦相似度进行排序。我们在四个数据集上针对分子到光谱、光谱到FP以及光谱-分子匹配注释工具对JESTR进行了评估。平均而言,对于排名@[1-20],JESTR比其他工具的性能高出55.5%-302.6%。我们进一步证明了在训练期间使用候选分子进行正则化的强大价值,在所有数据集上使排名@1的性能提高了5.72%,并增强了模型区分目标分子和候选分子的能力。当在MassSpecGym数据集的适当子集上比较JESTR与公开可用的SIRIUS和CFM-ID预训练模型的性能时,JESTR分别比这些工具高出31%和238%。通过JESTR,我们提供了一条通往准确注释的新的有前景的途径,从而揭示代谢组中的宝贵见解。