人工整理对于基因组数据库的注释来说并不足够。

Manual curation is not sufficient for annotation of genomic databases.

作者信息

Baumgartner William A, Cohen K Bretonnel, Fox Lynne M, Acquaah-Mensah George, Hunter Lawrence

机构信息

Center for Computational Pharmacology, University of Colorado School of Medicine, USA.

出版信息

Bioinformatics. 2007 Jul 1;23(13):i41-8. doi: 10.1093/bioinformatics/btm229.

DOI:10.1093/bioinformatics/btm229

PMID:17646325

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2516305/

Abstract

MOTIVATION

Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents.

RESULTS

Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.

摘要

动机

在计算生物学的发展过程中，知识库构建一直是一个活跃且极为重要的领域。然而，无论是关于知识库内容的评估，还是关于其构建过程的评估，相关工作几乎没有历史记录。本文提出将软件工程中一种称为发现/修复图的度量方法应用于评估基因组知识库的构建过程及其内容的完整性问题。

结果

在两个大型公开可用的知识库中发现了发现/修复图中易于理解的变化模式。这些模式表明，当前的人工编目过程即使要完成对最重要的模式生物的注释也将花费太长时间，而且按照它们目前的生产速度，永远不足以完成对所有现有蛋白质组的注释。

相似文献

Manual curation is not sufficient for annotation of genomic databases.人工整理对于基因组数据库的注释来说并不足够。

Bioinformatics. 2007 Jul 1;23(13):i41-8. doi: 10.1093/bioinformatics/btm229.

Automatic annotation of protein function.蛋白质功能的自动注释

Curr Opin Struct Biol. 2005 Jun;15(3):267-74. doi: 10.1016/j.sbi.2005.05.010.

GeConT: gene context analysis.GeConT：基因上下文分析。

Bioinformatics. 2004 Sep 22;20(14):2307-8. doi: 10.1093/bioinformatics/bth216. Epub 2004 Apr 8.

Filtering erroneous protein annotation.过滤错误的蛋白质注释。

Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7. doi: 10.1093/bioinformatics/bth938.

DIG--a system for gene annotation and functional discovery.DIG——一个用于基因注释和功能发现的系统。

Bioinformatics. 2005 Jul 1;21(13):2957-9. doi: 10.1093/bioinformatics/bti467. Epub 2005 May 3.

CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences.CGKB：豇豆（Vigna unguiculata L.）甲基化过滤基因组基因空间序列的注释知识库。

BMC Bioinformatics. 2007 Apr 19;8:129. doi: 10.1186/1471-2105-8-129.

A statistical framework for genomic data fusion.基因组数据融合的统计框架。

Bioinformatics. 2004 Nov 1;20(16):2626-35. doi: 10.1093/bioinformatics/bth294. Epub 2004 May 6.

Novel leverage of structural genomics.结构基因组学的新型应用

Nat Biotechnol. 2007 Aug;25(8):849-51. doi: 10.1038/nbt0807-849.

POLYVIEW: a flexible visualization tool for structural and functional annotations of proteins.POLYVIEW：一种用于蛋白质结构和功能注释的灵活可视化工具。

Bioinformatics. 2004 Oct 12;20(15):2460-2. doi: 10.1093/bioinformatics/bth248. Epub 2004 Apr 8.

MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results.MineBlast：一种通过对BLAST结果进行数据挖掘来支持蛋白质注释的文献展示服务。

Bioinformatics. 2005 Aug 15;21(16):3450-1. doi: 10.1093/bioinformatics/bti528. Epub 2005 Jun 7.

引用本文的文献

A large language model framework for literature-based disease-gene association prediction.一种基于文献的疾病-基因关联预测的大语言模型框架。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf070.

Harnessing PubMed User Query Logs for Post Hoc Explanations of Recommended Similar Articles.利用PubMed用户查询日志对推荐的相似文章进行事后解释。

ArXiv. 2024 Feb 5:arXiv:2402.03484v1.

PubMed and beyond: biomedical literature search in the age of artificial intelligence.PubMed 及其以外：人工智能时代的生物医学文献检索。

EBioMedicine. 2024 Feb;100:104988. doi: 10.1016/j.ebiom.2024.104988. Epub 2024 Feb 1.

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature.深度学习语言模型集合，用于支持针对 COVID-19 文献创建实时系统综述。

Syst Rev. 2023 Jun 5;12(1):94. doi: 10.1186/s13643-023-02247-9.

BLAB2CancerKD: a knowledge graph database focusing on the association between lactic acid bacteria and cancer, but beyond.BLAB2CancerKD：一个专注于乳酸菌与癌症之间关联的知识图谱数据库，但不止于此。

Database (Oxford). 2023 May 23;2023. doi: 10.1093/database/baad036.

Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.通过从生物医学摘要中进行多关系提取来扩展基于数据库的生物医学知识图谱。

BioData Min. 2022 Oct 18;15(1):26. doi: 10.1186/s13040-022-00311-z.

New reasons for biologists to write with a formal language.生物学家使用正式语言写作的新理由。

Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac039.

AnthraxKP: a knowledge graph-based, Anthrax Knowledge Portal mined from biomedical literature.炭疽病知识图谱：基于知识图谱的炭疽病知识库，从生物医学文献中挖掘而来。

Database (Oxford). 2022 Jun 2;2022. doi: 10.1093/database/baac037.

SKIOME Project: a curated collection of skin microbiome datasets enriched with study-related metadata.SKIOME 项目：一个经过策展的皮肤微生物组数据集集合，其中包含丰富的与研究相关的元数据。

Database (Oxford). 2022 May 16;2022. doi: 10.1093/database/baac033.

Text mining of gene-phenotype associations reveals new phenotypic profiles of autism-associated genes.基因-表型关联的文本挖掘揭示了自闭症相关基因的新表型特征。

Sci Rep. 2021 Jul 27;11(1):15269. doi: 10.1038/s41598-021-94742-z.

本文引用的文献

GeneRIF quality assurance as summary revision.作为总结性修订的基因RIF质量保证。

Pac Symp Biocomput. 2007:269-80. doi: 10.1142/9789812772435_0026.

Key biology databases go wiki.主要生物学数据库采用维基模式。

Nature. 2007 Feb 15;445(7129):691. doi: 10.1038/445691a.

Genome re-annotation: a wiki solution?基因组重新注释：一种维基解决方案？

Genome Biol. 2007;8(1):102. doi: 10.1186/gb-2007-8-1-102.

Publishing perishing? Towards tomorrow's information architecture.出版业正在消亡？迈向明日的信息架构。

BMC Bioinformatics. 2007 Jan 19;8:17. doi: 10.1186/1471-2105-8-17.

The database revolution.数据库革命。

Nature. 2007 Jan 18;445(7125):229-30. doi: 10.1038/445229b.

xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features.xGDB：用于基因组特征综合评估与分析的开源计算基础设施。

Genome Biol. 2006;7(11):R111. doi: 10.1186/gb-2006-7-11-r111.

GO PaD: the Gene Ontology Partition Database.基因本体分区数据库（GO PaD）

Nucleic Acids Res. 2007 Jan;35(Database issue):D322-7. doi: 10.1093/nar/gkl799. Epub 2006 Nov 10.

Finding GeneRIFs via gene ontology annotations.通过基因本体注释查找基因相关功能信息（GeneRIFs）

Pac Symp Biocomput. 2006:52-63.

A biocurator perspective: annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank.生物信息注释员视角：结构生物信息学蛋白质数据库研究协作实验室的注释工作

PLoS Comput Biol. 2006 Oct 27;2(10):e99. doi: 10.1371/journal.pcbi.0020099.

yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes.yrGATE：一种基于网络的基因结构注释工具，用于真核基因的识别与传播。

Genome Biol. 2006;7(7):R58. doi: 10.1186/gb-2006-7-7-r58.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。