Suppr超能文献

需要进行化学表示标准化,以便在《京都基因与基因组百科全书》、Reactome和MetaCyc知识库中推广代谢途径参与预测。

Chemical representation standardization needed to generalize metabolic pathway involvement prediction across the Kyoto Encyclopedia of Genes and Genomes, Reactome, and MetaCyc knowledgebases.

作者信息

Huckvale Erik D, Moseley Hunter N B

机构信息

Markey Cancer Center, University of Kentucky, Lexington, KY, USA.

Superfund Research Center, University of Kentucky, Lexington, KY, USA.

出版信息

bioRxiv. 2025 Apr 8:2025.04.02.646918. doi: 10.1101/2025.04.02.646918.

Abstract

MOTIVATION

Due to the utility of knowing the pathway involvement of compounds detected in biological experiments, knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and MetaCyc have aggregated pathway annotations of compounds. However, these annotations are largely incomplete and are costly to obtain experimentally and curate from published scientific literature.

RESULTS

We constructed a new dataset using compounds and their pathway annotations from KEGG, Reactome, and MetaCyc. Using this dataset, we trained and tested an extreme classification model that classifies 8,195 unique pathways based on compound chemical representations with a mean Matthews correlation coefficient (MCC) of 0.9036 ± 0.0033. During model evaluation, we discovered an inconsistency in chemical representations across knowledgebases, which was alleviated by standardizing the chemical representations using InChI (IUPAC International Chemical Identifier) canonicalization. Next, we compared the MCC between compounds and their cross-knowledgebase references. The non-standardized chemical representations had a huge 0.2687 drop in MCC while the standardized chemical representations only had a 0.0384 drop in MCC. Thus, standardizing chemical representation is an essential step when predicting on novel chemical representations.

AVAILABILITY AND IMPLEMENTATION

All code and data for reproducing the results of this manuscript are available in the following figshare items:Manuscript main results: https://doi.org/10.6084/m9.figshare.28701845CV analysis of model and dataset of prior studies: https://doi.org/10.6084/m9.figshare.28701590.

摘要

动机

由于了解生物实验中检测到的化合物的途径参与情况具有实用性,京都基因与基因组百科全书(KEGG)、Reactome和MetaCyc等知识库汇总了化合物的途径注释。然而,这些注释在很大程度上是不完整的,通过实验获取并从已发表的科学文献中整理成本很高。

结果

我们使用KEGG、Reactome和MetaCyc中的化合物及其途径注释构建了一个新数据集。利用这个数据集,我们训练并测试了一个极端分类模型,该模型基于化合物的化学表示对8195条独特途径进行分类,平均马修斯相关系数(MCC)为0.9036±0.0033。在模型评估过程中,我们发现不同知识库之间的化学表示存在不一致性,通过使用国际纯粹与应用化学联合会(IUPAC)国际化学标识符(InChI)规范化来标准化化学表示,这种不一致性得到了缓解。接下来,我们比较了化合物与其跨知识库参考之间的MCC。未标准化的化学表示的MCC下降了0.2687,而标准化的化学表示的MCC仅下降了0.0384。因此,在对新的化学表示进行预测时,标准化化学表示是必不可少的一步。

可用性和实现方式

用于重现本手稿结果的所有代码和数据可在以下figshare项目中获取:

手稿主要结果

https://doi.org/10.6084/m9.figshare.28701845

先前研究的模型和数据集的CV分析:https://doi.org/10.6084/m9.figshare.28701590。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9c4/12026579/b739307d3a10/nihpp-2025.04.02.646918v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验