Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8, Canada.
School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia.
BMC Bioinformatics. 2019 Apr 29;20(1):216. doi: 10.1186/s12859-019-2801-x.
The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct.
Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents.
BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
GenBank 等大型生物数据库包含大量记录,其内容主要基于外部资源,包括已发表的文献。手动注释用于确定文献和记录是否确实一致。我们在本文中探索了一种自动评估生物断言一致性的方法,以协助生物注释者,我们称之为 BARC,用于评估关系一致性的生物注释工具。在这种方法中,生物断言表示为两个对象(例如,基因和疾病)之间的关系;然后,我们使用我们新颖的基于集合的相关性算法 SaBRA 来检索相关文献,并应用分类器来估计这种关系(断言)正确的可能性。
我们使用 PubMed Central 集合评估基因-疾病关系和蛋白质-蛋白质相互作用的实验表明,BARC 可以有效地帮助注释者执行数据清理。具体来说,结果表明 BARC 大大优于最佳基线,在基因-疾病关系和蛋白质-蛋白质相互作用方面,F 值分别提高了 3.5%和 13%。我们还进行了特征分析,表明所有特征类型都是信息丰富的,文档的所有字段也是信息丰富的。
BARC 为生物注释社区提供了明显的好处,因为在大型生物数据库中没有用于识别不一致断言的先前自动化工具。