Huckvale Erik D, Moseley Hunter N B
Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA.
Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA.
bioRxiv. 2024 Apr 2:2024.04.01.587582. doi: 10.1101/2024.04.01.587582.
A major limitation of most metabolomics datasets is the sparsity of pathway annotations of detected metabolites. It is common for less than half of identified metabolites in these datasets to have known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a "pathway category", as defined by one of the metabolic knowledgebases like the Kyoto Encyclopedia of Gene and Genomes. Most of these models are implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating predictions for multiple pathway categories. This single binary classifier per pathway category approach both multiplies the computational resources necessary for training while diluting the positive entries in gold standard datasets needed for training. To address the limitations of training separate classifiers, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts both features representing a metabolite and features representing a generic pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite-pathway features-pair approach is not only competitive with the combined performance of training separate binary classifiers, but it outperforms the previous benchmark models.
大多数代谢组学数据集的一个主要局限性是检测到的代谢物的通路注释稀疏。在这些数据集中,通常只有不到一半的已鉴定代谢物参与已知的代谢途径。为了解决这一局限性,人们开发了机器学习模型来预测代谢物与“通路类别”的关联,这种关联由诸如京都基因与基因组百科全书等代谢知识库定义。这些模型大多被实现为特定于单个通路类别的单一二元分类器,需要一组二元分类器来为多个通路类别生成预测。这种每个通路类别使用单一二元分类器的方法既增加了训练所需的计算资源,又稀释了训练所需的金标准数据集中的正样本。为了解决训练单独分类器的局限性,我们提出了一种代谢途径预测问题的一般化方法,使用一个单一的二元分类器,该分类器既接受代表代谢物的特征,也接受代表通用通路类别的特征,然后预测给定的代谢物是否参与相应的通路类别。我们证明,这种代谢物-通路特征对方法不仅与训练单独二元分类器的综合性能具有竞争力,而且优于先前的基准模型。