Huckvale Erik D, Powell Christian D, Jin Huan, Moseley Hunter N B
Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA.
Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA.
Metabolites. 2023 Nov 1;13(11):1120. doi: 10.3390/metabo13111120.
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
代谢途径是人为定义的维持生命的生化反应分组,代谢物既是这些反应的反应物也是产物。但是许多公共数据集包含已鉴定的代谢物,其途径参与情况未知,这阻碍了代谢解释。为了解决这些缺点,已经开发了各种机器学习模型,包括那些基于京都基因与基因组百科全书(KEGG)数据训练的模型,以根据代谢物的化学描述预测其途径参与情况;然而,这些先前的模型基于旧的基于KEGG的代谢物数据集,包括一个由于存在超过1500个重复条目而无效的基准数据集。因此,我们按照科学计算可重复性的最佳标准,从KEGG开发了一个新的基准数据集,并包括随着KEGG变化更新基准数据集所需的所有源代码。我们使用这个新的基准数据集和我们的原子着色方法来开发和比较随机森林、XGBoost以及从我们的新基准数据集生成的带有自动编码器模型的多层感知器的性能。在1000个独特折叠中,最佳的总体加权平均性能是F1分数为0.8180,马修斯相关系数为0.7933,这是由XGBoost二元分类模型针对11个KEGG定义的途径类别提供的。