• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

对生物创意(BioCreAtIvE)和基因本体注释(GOA)的基因本体(GO)注释检索的评估。

An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.

作者信息

Camon Evelyn B, Barrell Daniel G, Dimmer Emily C, Lee Vivian, Magrane Michele, Maslen John, Binns David, Apweiler Rolf

机构信息

European Molecular Biology Laboratory, European Bionformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

出版信息

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.

DOI:10.1186/1471-2105-6-S1-S17
PMID:15960829
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1869009/
Abstract

BACKGROUND

The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.

RESULTS

The GOA database currently extracts GO annotation from the literature with 91 to 100% precision, and at least 72% recall. This creates a particularly high threshold for text mining systems which in BioCreAtIvE task 2 (GO annotation extraction and retrieval) initial results precisely predicted GO terms only 10 to 20% of the time.

CONCLUSION

Improvements in the performance and accuracy of text mining for GO terms should be expected in the next BioCreAtIvE challenge. In the meantime the manual and electronic GO annotation strategies already employed by GOA will provide high quality annotations.

摘要

背景

基因本体注释(GOA)数据库(http://www.ebi.ac.uk/GOA)旨在为通用蛋白质数据库(UniProt知识库)中的蛋白质提供高质量的补充GO注释。与许多其他生物数据库一样,GOA的大部分内容来自对文献的精心人工整理。然而,随着文献量和需要表征的蛋白质数量的增加,人工处理能力可能会不堪重负。因此,经常采用半自动辅助工具来加快整理过程。传统上,GOA中的电子技术在很大程度上依赖于利用现有资源(如InterPro)中的知识。然而,近年来,文本挖掘被誉为辅助整理过程的一种潜在有用工具。为鼓励此类工具的开发,欧洲生物信息学研究所(EBI)的GOA团队同意参与生物创意(生物学信息提取系统的关键评估)挑战的功能注释任务。生物创意任务2是一项实验,旨在测试使用信息检索和提取自动得出的分类是否有助于专家生物学家对通用蛋白质数据库中的蛋白质进行GO词汇注释。GOA提供了从文献中提取的9000多个手动GO注释的训练语料库。对于测试集,我们提供了一个包含200篇《生物化学杂志》新文章的语料库,用于用GO术语注释286种人类蛋白质。一组专家手动评估了9个参与小组的结果,每个小组都提供了突出显示的句子来支持他们的GO和蛋白质注释预测。在此,我们从生物学角度对评估进行阐述,解释我们如何使用文献对GO进行注释,并就提高未来文本检索和提取技术的精度提出一些建议。最后,我们提供了首次人工GO整理注释者间一致性研究的结果,以及对我们当前电子GO注释策略的评估。

结果

GOA数据库目前从文献中提取GO注释的精度为91%至100%,召回率至少为72%。这为文本挖掘系统设定了一个特别高的门槛,在生物创意任务2(GO注释提取和检索)中,初始结果仅在10%至20%的时间内准确预测了GO术语。

结论

在下一次生物创意挑战中,预计GO术语文本挖掘的性能和准确性会有所提高。与此同时,GOA已经采用的人工和电子GO注释策略将提供高质量的注释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7c3/1869009/5a4309b97389/1471-2105-6-S1-S17-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7c3/1869009/5a4309b97389/1471-2105-6-S1-S17-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7c3/1869009/5a4309b97389/1471-2105-6-S1-S17-1.jpg

相似文献

1
An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.对生物创意(BioCreAtIvE)和基因本体注释(GOA)的基因本体(GO)注释检索的评估。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.
2
Evaluation of BioCreAtIvE assessment of task 2.生物创意任务2评估的评价
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.
3
The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.基因本体注释(GOA)数据库:在UniProt中与基因本体共享知识。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. doi: 10.1093/nar/gkh021.
4
Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks.基因本体注释的自动提取及其与蛋白质网络中聚类的相关性。
BMC Bioinformatics. 2007 Jul 10;8:243. doi: 10.1186/1471-2105-8-243.
5
Text mining and protein annotations: the construction and use of protein description sentences.文本挖掘与蛋白质注释:蛋白质描述语句的构建与应用
Genome Inform. 2006;17(2):121-30.
6
Overview of the gene ontology task at BioCreative IV.生物创意IV基因本体任务概述。
Database (Oxford). 2014 Aug 25;2014. doi: 10.1093/database/bau086. Print 2014.
7
BC4GO: a full-text corpus for the BioCreative IV GO task.BC4GO:用于生物创意IV基因本体任务的全文语料库。
Database (Oxford). 2014 Jul 28;2014. doi: 10.1093/database/bau074. Print 2014.
8
Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.生物学文本挖掘系统评估:第二届生物创意社区挑战赛概述
Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.
9
Overview of the protein-protein interaction annotation extraction task of BioCreative II.生物创意II蛋白质-蛋白质相互作用注释提取任务概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.
10
The GOA database in 2009--an integrated Gene Ontology Annotation resource.2009年的基因本体注释(GOA)数据库——一个整合的基因本体注释资源。
Nucleic Acids Res. 2009 Jan;37(Database issue):D396-403. doi: 10.1093/nar/gkn803. Epub 2008 Oct 27.

引用本文的文献

1
Helping authors produce FAIR taxonomic data: evaluation of an author-driven phenotype data production prototype.帮助作者生成可实现公平原则的分类学数据:对作者驱动的表型数据生成原型的评估
Database (Oxford). 2025 Jan 29;2025. doi: 10.1093/database/baae097.
2
Annotation Vocabulary (Might Be) All You Need.注释词汇(可能)就是你所需要的一切。
bioRxiv. 2024 Jul 31:2024.07.30.605924. doi: 10.1101/2024.07.30.605924.
3
Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation.背景知识的整合用于自动检测基因本体论注释中的不一致性。

本文引用的文献

1
Ontologies for biologists: a community model for the annotation of genomic data.面向生物学家的本体:基因组数据注释的社区模型。
Cold Spring Harb Symp Quant Biol. 2003;68:227-35. doi: 10.1101/sqb.2003.68.227.
2
GOblet: a platform for Gene Ontology annotation of anonymous sequence data.GOblet:一个用于对匿名序列数据进行基因本体注释的平台。
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W313-7. doi: 10.1093/nar/gkh406.
3
Mapping Gene Ontology to proteins based on protein-protein interaction data.基于蛋白质-蛋白质相互作用数据将基因本体映射到蛋白质。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i390-i400. doi: 10.1093/bioinformatics/btae246.
4
Domain-PFP allows protein function prediction using function-aware domain embedding representations.域-PFP 使用感知功能的域嵌入表示来进行蛋白质功能预测。
Commun Biol. 2023 Oct 31;6(1):1103. doi: 10.1038/s42003-023-05476-9.
5
Domain-PFP: Protein Function Prediction Using Function-Aware Domain Embedding Representations.领域-蛋白质功能预测:使用功能感知领域嵌入表示进行蛋白质功能预测。
bioRxiv. 2023 Aug 24:2023.08.23.554486. doi: 10.1101/2023.08.23.554486.
6
scDual-Seq of -infected mouse BMDCs reveals heterogeneity and differential infection dynamics.scDual-Seq 分析感染小鼠 BMDC 揭示了异质性和差异感染动力学。
Front Immunol. 2023 Jul 27;14:1224591. doi: 10.3389/fimmu.2023.1224591. eCollection 2023.
7
Genetic structure and first genome-wide insights into the adaptation of a wild relative of grapevine, .葡萄野生近缘种的遗传结构及全基因组首次适应性见解
Evol Appl. 2023 Jun 9;16(6):1184-1200. doi: 10.1111/eva.13566. eCollection 2023 Jun.
8
Genomic analyses point to a low evolutionary potential of prospective source populations for assisted migration in a forest herb.基因组分析表明,一种森林草本植物中用于辅助迁移的潜在源种群的进化潜力较低。
Evol Appl. 2022 Oct 2;15(11):1859-1874. doi: 10.1111/eva.13485. eCollection 2022 Nov.
9
Coevolution of Metabolic Pathways in Blattodea and Their Endosymbionts, and Comparisons with Other Insect-Bacteria Symbioses.直翅目昆虫与其内共生菌代谢途径的协同进化,及其与其他昆虫-细菌共生关系的比较。
Microbiol Spectr. 2022 Oct 26;10(5):e0277922. doi: 10.1128/spectrum.02779-22. Epub 2022 Sep 12.
10
Exploring automatic inconsistency detection for literature-based gene ontology annotation.探索基于文献的基因本体论自动标注不一致性检测。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i273-i281. doi: 10.1093/bioinformatics/btac230.
Bioinformatics. 2004 Apr 12;20(6):895-902. doi: 10.1093/bioinformatics/btg500. Epub 2004 Jan 29.
4
Comparing genomic expression patterns across species identifies shared transcriptional profile in aging.比较不同物种间的基因组表达模式可确定衰老过程中共享的转录谱。
Nat Genet. 2004 Feb;36(2):197-204. doi: 10.1038/ng1291. Epub 2004 Jan 18.
5
The mouse Gene Expression Database (GXD): updates and enhancements.小鼠基因表达数据库(GXD):更新与增强
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D568-71. doi: 10.1093/nar/gkh069.
6
The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.基因本体注释(GOA)数据库:在UniProt中与基因本体共享知识。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. doi: 10.1093/nar/gkh021.
7
The Gene Ontology (GO) database and informatics resource.基因本体论(GO)数据库及信息资源。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61. doi: 10.1093/nar/gkh036.
8
UniProt: the Universal Protein knowledgebase.通用蛋白质知识库(UniProt)。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D115-9. doi: 10.1093/nar/gkh131.
9
The EMBL Nucleotide Sequence Database.欧洲分子生物学实验室核苷酸序列数据库。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D27-30. doi: 10.1093/nar/gkh120.
10
Tough mining: the challenges of searching the scientific literature.艰难的挖掘:搜索科学文献的挑战。
PLoS Biol. 2003 Nov;1(2):E48. doi: 10.1371/journal.pbio.0000048. Epub 2003 Nov 17.