Huckvale Erik D, Moseley Hunter N B
Markey Cancer Center, University of Kentucky, Lexington, KY, USA.
Superfund Research Center, University of Kentucky, Lexington, KY, USA.
bioRxiv. 2025 Apr 8:2025.04.02.646918. doi: 10.1101/2025.04.02.646918.
Due to the utility of knowing the pathway involvement of compounds detected in biological experiments, knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and MetaCyc have aggregated pathway annotations of compounds. However, these annotations are largely incomplete and are costly to obtain experimentally and curate from published scientific literature.
We constructed a new dataset using compounds and their pathway annotations from KEGG, Reactome, and MetaCyc. Using this dataset, we trained and tested an extreme classification model that classifies 8,195 unique pathways based on compound chemical representations with a mean Matthews correlation coefficient (MCC) of 0.9036 ± 0.0033. During model evaluation, we discovered an inconsistency in chemical representations across knowledgebases, which was alleviated by standardizing the chemical representations using InChI (IUPAC International Chemical Identifier) canonicalization. Next, we compared the MCC between compounds and their cross-knowledgebase references. The non-standardized chemical representations had a huge 0.2687 drop in MCC while the standardized chemical representations only had a 0.0384 drop in MCC. Thus, standardizing chemical representation is an essential step when predicting on novel chemical representations.
All code and data for reproducing the results of this manuscript are available in the following figshare items:Manuscript main results: https://doi.org/10.6084/m9.figshare.28701845CV analysis of model and dataset of prior studies: https://doi.org/10.6084/m9.figshare.28701590.
由于了解生物实验中检测到的化合物的途径参与情况具有实用性,京都基因与基因组百科全书(KEGG)、Reactome和MetaCyc等知识库汇总了化合物的途径注释。然而,这些注释在很大程度上是不完整的,通过实验获取并从已发表的科学文献中整理成本很高。
我们使用KEGG、Reactome和MetaCyc中的化合物及其途径注释构建了一个新数据集。利用这个数据集,我们训练并测试了一个极端分类模型,该模型基于化合物的化学表示对8195条独特途径进行分类,平均马修斯相关系数(MCC)为0.9036±0.0033。在模型评估过程中,我们发现不同知识库之间的化学表示存在不一致性,通过使用国际纯粹与应用化学联合会(IUPAC)国际化学标识符(InChI)规范化来标准化化学表示,这种不一致性得到了缓解。接下来,我们比较了化合物与其跨知识库参考之间的MCC。未标准化的化学表示的MCC下降了0.2687,而标准化的化学表示的MCC仅下降了0.0384。因此,在对新的化学表示进行预测时,标准化化学表示是必不可少的一步。
用于重现本手稿结果的所有代码和数据可在以下figshare项目中获取:
https://doi.org/10.6084/m9.figshare.28701845
先前研究的模型和数据集的CV分析:https://doi.org/10.6084/m9.figshare.28701590。