Enveda Biosciences, Boulder, CO, USA.
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac481.
Recent advances in Knowledge Graphs (KGs) and Knowledge Graph Embedding Models (KGEMs) have led to their adoption in a broad range of fields and applications. The current publishing system in machine learning requires newly introduced KGEMs to achieve state-of-the-art performance, surpassing at least one benchmark in order to be published. Despite this, dozens of novel architectures are published every year, making it challenging for users, even within the field, to deduce the most suitable configuration for a given application. A typical biomedical application of KGEMs is drug-disease prediction in the context of drug discovery, in which a KGEM is trained to predict triples linking drugs and diseases. These predictions can be later tested in clinical trials following extensive experimental validation. However, given the infeasibility of evaluating each of these predictions and that only a minimal number of candidates can be experimentally tested, models that yield higher precision on the top prioritized triples are preferred. In this paper, we apply the concept of ensemble learning on KGEMs for drug discovery to assess whether combining the predictions of several models can lead to an overall improvement in predictive performance. First, we trained and benchmarked 10 KGEMs to predict drug-disease triples on two independent biomedical KGs designed for drug discovery. Following, we applied different ensemble methods that aggregate the predictions of these models by leveraging the distribution or the position of the predicted triple scores. We then demonstrate how the ensemble models can achieve better results than the original KGEMs by benchmarking the precision (i.e., number of true positives prioritized) of their top predictions. Lastly, we released the source code presented in this work at https://github.com/enveda/kgem-ensembles-in-drug-discovery.
近年来,知识图谱(Knowledge Graphs,KGs)和知识图谱嵌入模型(Knowledge Graph Embedding Models,KGEMs)的发展使得它们在广泛的领域和应用中得到了采用。目前机器学习的出版系统要求新引入的 KGEM 要达到最先进的性能,即至少要在一个基准上超越,才能被发表。尽管如此,每年仍有数十个新的架构被发表,这使得即使是在该领域内的用户也难以推断出给定应用程序最合适的配置。KGEM 在药物发现背景下进行药物-疾病预测是其在生物医学领域的一个典型应用,在该应用中,KGEM 被训练来预测将药物和疾病联系起来的三元组。这些预测可以在经过广泛的实验验证后,在临床试验中进行测试。然而,由于评估这些预测中的每一个都是不可行的,并且只有少数候选药物可以进行实验测试,因此更倾向于使用那些在优先级最高的三元组上获得更高精度的模型。在本文中,我们将集成学习的概念应用于药物发现的 KGEM 中,以评估是否可以通过结合多个模型的预测来提高预测性能。首先,我们在两个专为药物发现设计的独立生物医学 KGs 上训练和基准测试了 10 个 KGEM,以预测药物-疾病三元组。然后,我们应用了不同的集成方法,通过利用预测三元组分数的分布或位置来汇总这些模型的预测。然后,我们通过基准测试其最高预测的精度(即优先的真阳性数量)来展示集成模型如何获得比原始 KGEM 更好的结果。最后,我们在 https://github.com/enveda/kgem-ensembles-in-drug-discovery 上发布了本文中提出的源代码。