GO 关系的自预测有助于其质量审核。

Self-prediction of relations in GO facilitates its quality auditing.

机构信息

School of Computer Science, University of South China, Hengyang, Hunan, 421001, China.

出版信息

J Biomed Inform. 2023 Aug;144:104441. doi: 10.1016/j.jbi.2023.104441. Epub 2023 Jul 10.

DOI:10.1016/j.jbi.2023.104441

Abstract

As applications of the gene ontology (GO) increase rapidly in the biomedical field, quality auditing of it is becoming more and more important. Existing auditing methods are mostly based on rules, observed patterns or hypotheses. In this study, we propose a machine-learning-based framework for GO to audit itself: we first predict the IS-A relations among concepts in GO, then use differences between predicted results and existing relations to uncover potential errors. Specifically, we transfer the taxonomy of GO 2020 January release into a dataset with concept pairs as items and relations between them as labels(pairs with no direct IS-A relation are labeled as ndrs). To fully obtain the representation of each pair, we integrate the embeddings for the concept name, concept definition, as well as concept node in a substring-based topological graph. We divide the dataset into 10 parts, and rotate over all the parts by choosing one part as the testing set and the remaining as the training set each time. After 10 rotations, the prediction model predicted 4,640 existing IS-A pairs as ndrs. In the GO 2022 March release, 340 of these predictions were validated, demonstrating significance with a p-value of 1.60e-46 when compared to the results of randomly selected pairs. On the other hand, the model predicted 2,840 out of 17,079 selected ndrs in GO to be IS-A's relations. After deleting those that caused redundancies and circles, 924 predicted IS-A's relations remained. Among 200 pairs randomly selected, 30 were validated as missing IS-A's by domain experts. In conclusion, this study investigates a novel way of auditing biomedical ontologies by predicting the relations in it, which was shown to be useful for discovering potential errors.

摘要

随着基因本体论（GO）在生物医学领域的应用迅速增加，对其进行质量审核变得越来越重要。现有的审核方法主要基于规则、观察模式或假设。在这项研究中，我们提出了一种基于机器学习的 GO 自我审核框架：我们首先预测 GO 中概念之间的 IS-A 关系，然后使用预测结果与现有关系之间的差异来发现潜在的错误。具体来说，我们将 GO 2020 年 1 月版的分类法转换为一个数据集，其中概念对作为项目，它们之间的关系作为标签（没有直接 IS-A 关系的对标记为 ndrs）。为了充分获取每一对的表示，我们将概念名称、概念定义以及基于子字符串拓扑图中的概念节点的嵌入集成在一起。我们将数据集分为 10 部分，每次选择一部分作为测试集，其余部分作为训练集，在 10 次旋转后，预测模型预测了 4640 个现有的 IS-A 对为 ndrs。在 GO 2022 年 3 月版中，对这些预测中的 340 个进行了验证，与随机选择的对相比，p 值为 1.60e-46，具有显著意义。另一方面，模型预测了 GO 中 17079 个选定的 ndrs 中的 2840 个是 IS-A 的关系。删除那些导致冗余和循环的关系后，剩下 924 个预测的 IS-A 关系。在随机选择的 200 对中，有 30 对被领域专家验证为缺失的 IS-A 关系。总之，这项研究通过预测其中的关系，探讨了一种审核生物医学本体的新方法，结果表明该方法有助于发现潜在的错误。