自动检测生物序列数据库中与文献不一致的记录。

Automated detection of records in biological sequence databases that are inconsistent with the literature.

作者信息

Bouadjenek Mohamed Reda, Verspoor Karin, Zobel Justin

机构信息

Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.

出版信息

J Biomed Inform. 2017 Jul;71:229-240. doi: 10.1016/j.jbi.2017.06.015. Epub 2017 Jun 15.

DOI:10.1016/j.jbi.2017.06.015

PMID:28624643

Abstract

We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

摘要

我们对核苷酸序列数据库的数据质量进行调查和分析，目的是自动检测数据异常和可疑记录。具体而言，我们证明了与每个数据记录相关的已发表文献可用于自动评估其质量，方法是将数据库记录的关键内容与参考文献进行交叉核对，以检验其一致性。以GenBank为重点，我们基于信息检索（IR）的相关性范式描述了一组质量指标。然后，我们使用这些质量指标训练一种异常检测算法，将记录分类为“可信”或“可疑”。我们在PubMed Central数据集上的实验表明，通过我们的算法评估文献与数据库记录之间的一致性，是协助管理人员进行数据清理的有效机制。虽然我们数据集中已知有缺陷的记录不到0.25%，但我们预计GenBank中还有更多尚未被识别的记录。通过与文献进行自动比较，这些记录的识别精度可达10%，召回率可达30%，同时性能明显优于多个基线。虽然这些结果还有很大的改进空间，但它们既反映了数据的严重不平衡性质，也反映了可用的明确标记数据有限。总体而言，所获得的结果为开发一种基于文献分析和一致性检测低质量和可疑序列记录的新方法带来了希望。从实际角度来看，这将极大地帮助管理人员通过突出显示可能与文献不一致的记录，在大规模序列数据库中识别不一致的记录。