Cohen Barry, Oren Marc, Min Hua, Perl Yehoshua, Halper Michael
Computer Science Department, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA.
J Biomed Inform. 2008 Dec;41(6):904-13. doi: 10.1016/j.jbi.2008.03.010. Epub 2008 Mar 28.
Biomedical research has identified many human genes and various knowledge about them. The National Cancer Institute Thesaurus (NCIT) represents such knowledge as concepts and roles (relationships). Due to the rapid advances in this field, it is to be expected that the NCIT's Gene hierarchy will contain role errors. A comparative methodology to audit the Gene hierarchy with the use of the National Center for Biotechnology Information's (NCBI's) Entrez Gene database is presented. The two knowledge sources are accessed via a pair of Web crawlers to ensure up-to-date data. Our algorithms then compare the knowledge gathered from each, identify discrepancies that represent probable errors, and suggest corrective actions. The primary focus is on two kinds of gene-roles: (1) the chromosomal locations of genes, and (2) the biological processes in which genes play a role. Regarding chromosomal locations, the discrepancies revealed are striking and systematic, suggesting a structurally common origin. In regard to the biological processes, difficulties arise because genes frequently play roles in multiple processes, and processes may have many designations (such as synonymous terms). Our algorithms make use of the roles defined in the NCIT Biological Process hierarchy to uncover many probable gene-role errors in the NCIT. These results show that automated comparative auditing is a promising technique that can identify a large number of probable errors and corrections for them in a terminological genomic knowledge repository, thus facilitating its overall maintenance.
生物医学研究已经识别出许多人类基因以及关于它们的各种知识。美国国立癌症研究所术语表(NCIT)将这些知识表示为概念和角色(关系)。由于该领域的快速发展,可以预期NCIT的基因层次结构会包含角色错误。本文提出了一种使用美国国立生物技术信息中心(NCBI)的Entrez基因数据库来审核基因层次结构的比较方法。通过一对网络爬虫访问这两个知识源,以确保数据的时效性。然后我们的算法比较从每个来源收集的知识,识别出可能代表错误的差异,并提出纠正措施。主要关注两种基因角色:(1)基因的染色体位置,以及(2)基因发挥作用的生物学过程。关于染色体位置,所揭示的差异是显著且系统的,表明存在结构上的共同起源。关于生物学过程,由于基因经常在多个过程中发挥作用,并且过程可能有许多名称(如同义词),所以会出现困难。我们的算法利用NCIT生物过程层次结构中定义的角色来发现NCIT中许多可能的基因角色错误。这些结果表明,自动比较审核是一种很有前景的技术,它可以在术语基因组知识库中识别大量可能的错误并对其进行纠正,从而便于其整体维护。