Suppr超能文献

是否可以通过推断来源及其可视化来检测错误注释?使用 UniProtKB 的案例研究。

Can inferred provenance and its visualisation be used to detect erroneous annotation? A case study using UniProtKB.

机构信息

School of Computing Science, Newcastle University, Newcastle upon Tyne, United Kingdom.

出版信息

PLoS One. 2013 Oct 15;8(10):e75541. doi: 10.1371/journal.pone.0075541. eCollection 2013.

Abstract

A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over [Formula: see text] sentences. Over [Formula: see text] sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately [Formula: see text] are erroneous, whilst [Formula: see text] appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website at http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/.

摘要

新数据的不断涌入给保持生物数据库注释的时效性带来了挑战。大多数生物数据库都包含大量的文本注释,这些注释往往是最丰富的知识来源。许多数据库都在重复使用现有的知识;在注释过程中,注释经常在条目之间传播。然而,这通常并不明确。因此,读者很难(甚至不可能)确定注释的来源。在这项工作中,我们试图确定注释的出处,并跟踪其后续传播。具体来说,我们利用 UniProt Knowledgebase(UniProtKB)中的注释重复利用,在句子的水平上进行操作。我们描述了一种在 UniProtKB 中对句子的出处和传播进行可视化的方法,该方法可以进行大规模的统计分析。最初分析了 UniProtKB 中的句子重复利用水平,结果表明重复利用非常普遍,这使得出处和传播的跟踪成为可能。通过对 UniProtKB 中的句子进行分析,发现了一些有趣的传播模式,涵盖了超过[Formula: see text]个句子。在从最初出现的条目中删除后,仍有超过[Formula: see text]个句子保留在数据库中。对这些句子的子集进行分析表明,大约[Formula: see text]个句子是错误的,而[Formula: see text]个句子似乎不一致。这些结果表明,能够可视化句子的传播和出处有助于确定文本注释的准确性和质量。源代码和补充数据可从作者的网站(http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/)获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fef4/3797126/3454fef3b6aa/pone.0075541.g016.jpg

相似文献

1
Can inferred provenance and its visualisation be used to detect erroneous annotation? A case study using UniProtKB.
PLoS One. 2013 Oct 15;8(10):e75541. doi: 10.1371/journal.pone.0075541. eCollection 2013.
2
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB.
Bioinformatics. 2012 Sep 15;28(18):i562-i568. doi: 10.1093/bioinformatics/bts372.
3
UniProtKB/Swiss-Prot.
Methods Mol Biol. 2007;406:89-112. doi: 10.1007/978-1-59745-535-0_4.
4
An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.
5
UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase.
Bioinformatics. 2020 Nov 1;36(17):4643-4648. doi: 10.1093/bioinformatics/btaa485.
6
UniSave: the UniProtKB sequence/annotation version database.
Bioinformatics. 2006 May 15;22(10):1284-5. doi: 10.1093/bioinformatics/btl105. Epub 2006 Mar 21.
7
UniProt Knowledgebase: a hub of integrated protein data.
Database (Oxford). 2011 Mar 29;2011:bar009. doi: 10.1093/database/bar009. Print 2011.
9
UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB.
Bioinformatics. 2016 Aug 1;32(15):2264-71. doi: 10.1093/bioinformatics/btw114. Epub 2016 Mar 7.
10
The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. doi: 10.1093/nar/gkh021.

引用本文的文献

2
Propagation, detection and correction of errors using the sequence database network.
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac416.
3
The reuse of public datasets in the life sciences: potential risks and rewards.
PeerJ. 2020 Sep 22;8:e9954. doi: 10.7717/peerj.9954. eCollection 2020.
4
On patterns and re-use in bioinformatics databases.
Bioinformatics. 2017 Sep 1;33(17):2731-2736. doi: 10.1093/bioinformatics/btx310.
5
Functional Annotations of Paralogs: A Blessing and a Curse.
Life (Basel). 2016 Sep 8;6(3):39. doi: 10.3390/life6030039.
6
HAMAP in 2015: updates to the protein family classification and annotation system.
Nucleic Acids Res. 2015 Jan;43(Database issue):D1064-70. doi: 10.1093/nar/gku1002. Epub 2014 Oct 27.

本文引用的文献

1
Biases in the experimental annotations of protein function and their effect on our understanding of protein function space.
PLoS Comput Biol. 2013;9(5):e1003063. doi: 10.1371/journal.pcbi.1003063. Epub 2013 May 30.
2
The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection.
Nucleic Acids Res. 2013 Jan;41(Database issue):D1-7. doi: 10.1093/nar/gks1297. Epub 2012 Nov 30.
3
Update on activities at the Universal Protein Resource (UniProt) in 2013.
Nucleic Acids Res. 2013 Jan;41(Database issue):D43-7. doi: 10.1093/nar/gks1068. Epub 2012 Nov 17.
4
Opportunities for text mining in the FlyBase genetic literature curation workflow.
Database (Oxford). 2012 Nov 17;2012:bas039. doi: 10.1093/database/bas039. Print 2012.
5
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB.
Bioinformatics. 2012 Sep 15;28(18):i562-i568. doi: 10.1093/bioinformatics/bts372.
6
neXtProt: a knowledge platform for human proteins.
Nucleic Acids Res. 2012 Jan;40(Database issue):D76-83. doi: 10.1093/nar/gkr1179. Epub 2011 Dec 1.
7
Reorganizing the protein space at the Universal Protein Resource (UniProt).
Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. doi: 10.1093/nar/gkr981. Epub 2011 Nov 18.
8
UniProt Knowledgebase: a hub of integrated protein data.
Database (Oxford). 2011 Mar 29;2011:bar009. doi: 10.1093/database/bar009. Print 2011.
9
Cytoscape 2.8: new features for data integration and network visualization.
Bioinformatics. 2011 Feb 1;27(3):431-2. doi: 10.1093/bioinformatics/btq675. Epub 2010 Dec 12.
10
Gene Ontology annotation quality analysis in model eukaryotes.
Nucleic Acids Res. 2008 Feb;36(2):e12. doi: 10.1093/nar/gkm1167. Epub 2008 Jan 10.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验