School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
Department of Computer Science.
Bioinformatics. 2020 May 1;36(10):3207-3214. doi: 10.1093/bioinformatics/btaa106.
The Gene Ontology (GO) is the unifying biological vocabulary for codifying, managing and sharing biological knowledge. Quality issues in GO, if not addressed, can cause misleading results or missed biological discoveries. Manual identification of potential quality issues in GO is a challenging and arduous task, given its growing size. We introduce an automated auditing approach for suggesting potentially missing is-a relations, which may further reveal erroneous is-a relations.
We developed a Subsumption-based Sub-term Inference Framework (SSIF) by leveraging a novel term-algebra on top of a sequence-based representation of GO concepts along with three conditional rules (monotonicity, intersection and sub-concept rules). Applying SSIF to the October 3, 2018 release of GO suggested 1938 unique potentially missing is-a relations. Domain experts evaluated a random sample of 210 potentially missing is-a relations. The results showed SSIF achieved a precision of 60.61, 60.49 and 46.03% for the monotonicity, intersection and sub-concept rules, respectively.
SSIF is implemented in Java. The source code is available at https://github.com/rashmie/SSIF.
Supplementary data are available at Bioinformatics online.
基因本体论 (GO) 是用于编码、管理和共享生物学知识的统一生物学词汇。如果不解决 GO 中的质量问题,可能会导致误导性的结果或错过生物学发现。鉴于其不断增长的规模,手动识别 GO 中的潜在质量问题是一项具有挑战性和艰巨的任务。我们引入了一种自动化的审核方法,用于建议潜在缺失的“is-a”关系,这可能进一步揭示错误的“is-a”关系。
我们通过利用基于序列的 GO 概念表示形式以及三个条件规则(单调性、交集和子概念规则)之上的新术语代数,开发了基于包含的子术语推断框架 (SSIF)。将 SSIF 应用于 2018 年 10 月 3 日发布的 GO 版本,建议了 1938 个独特的潜在缺失的“is-a”关系。领域专家评估了 210 个潜在缺失的“is-a”关系的随机样本。结果表明,SSIF 在单调性、交集和子概念规则方面的精度分别为 60.61%、60.49%和 46.03%。
SSIF 是用 Java 实现的。源代码可在 https://github.com/rashmie/SSIF 上获得。
补充数据可在生物信息学在线获得。