School of Computing Science, Newcastle University, Newcastle upon Tyne, United Kingdom.
PLoS One. 2013 Oct 15;8(10):e75541. doi: 10.1371/journal.pone.0075541. eCollection 2013.
A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over [Formula: see text] sentences. Over [Formula: see text] sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately [Formula: see text] are erroneous, whilst [Formula: see text] appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website at http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/.
新数据的不断涌入给保持生物数据库注释的时效性带来了挑战。大多数生物数据库都包含大量的文本注释,这些注释往往是最丰富的知识来源。许多数据库都在重复使用现有的知识;在注释过程中,注释经常在条目之间传播。然而,这通常并不明确。因此,读者很难(甚至不可能)确定注释的来源。在这项工作中,我们试图确定注释的出处,并跟踪其后续传播。具体来说,我们利用 UniProt Knowledgebase(UniProtKB)中的注释重复利用,在句子的水平上进行操作。我们描述了一种在 UniProtKB 中对句子的出处和传播进行可视化的方法,该方法可以进行大规模的统计分析。最初分析了 UniProtKB 中的句子重复利用水平,结果表明重复利用非常普遍,这使得出处和传播的跟踪成为可能。通过对 UniProtKB 中的句子进行分析,发现了一些有趣的传播模式,涵盖了超过[Formula: see text]个句子。在从最初出现的条目中删除后,仍有超过[Formula: see text]个句子保留在数据库中。对这些句子的子集进行分析表明,大约[Formula: see text]个句子是错误的,而[Formula: see text]个句子似乎不一致。这些结果表明,能够可视化句子的传播和出处有助于确定文本注释的准确性和质量。源代码和补充数据可从作者的网站(http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/)获得。