黑腹果蝇的基因组注释评估

Genome annotation assessment in Drosophila melanogaster.

作者信息

Reese M G, Hartzell G, Harris N L, Ohler U, Abril J F, Lewis S E

机构信息

Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology, University of California, Berkeley 94720-3200, USA.

出版信息

Genome Res. 2000 Apr;10(4):483-501. doi: 10.1101/gr.10.4.483.

DOI:10.1101/gr.10.4.483

PMID:10779488

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC310877/

Abstract

Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.

摘要

自动化基因组注释的计算方法对于我们的科研群体充分利用大量已生成和发布的基因组序列的能力至关重要。为了探究这些自动化特征预测工具在高等生物基因组中的准确性，我们在来自黑腹果蝇Adh区域的一个大型、特征明确的序列重叠群上评估了它们的性能。这个实验，即基因组注释评估项目（GASP），于1999年5月启动。十二个团队应用了最先进的工具，对包括基因结构、蛋白质同源性、启动子位点和重复元件等特征进行了预测。我们使用两种标准评估这些预测，一种基于先前未发布的高质量全长cDNA序列，另一种基于一组果蝇专家对该区域进行深入研究时生成的注释集。尽管这些标准集仅近似该区域特征的未知分布，但我们认为结合上下文来看，基于它们的评估结果是有意义的。这些结果在1999年8月的分子生物学智能系统会议（ISMB - 99）上作为教程展示。该区域超过95%的编码核苷酸被大多数基因识别工具正确识别，并且超过40%的基因的正确内含子/外显子结构被预测出来。基于同源性的注释技术识别出该区域近一半的基因并赋予其功能；其余的仅通过从头预测技术识别。这个实验还首次对一个大的连续区域中大量基因的启动子预测技术进行了评估。我们发现启动子预测工具的高假阳性率使得它们的预测难以使用。将基因识别和cDNA/EST比对与启动子预测相结合可以减少假阳性分类的数量，但只能发现该区域不到三分之一的启动子。我们相信，通过建立评估基因组注释的标准并评估现有自动化基因组注释工具的性能，这个实验建立了一个基线，有助于正在进行的大规模注释项目的价值，并应指导基因组信息学方面的进一步研究。

相似文献

Genome annotation assessment in Drosophila melanogaster.黑腹果蝇的基因组注释评估

Genome Res. 2000 Apr;10(4):483-501. doi: 10.1101/gr.10.4.483.

Genie--gene finding in Drosophila melanogaster.精灵——黑腹果蝇中的基因发现

Genome Res. 2000 Apr;10(4):529-38. doi: 10.1101/gr.10.4.529.

Drosophila genomic sequence annotation using the BLOCKS+ database.使用BLOCKS+数据库对果蝇基因组序列进行注释。

Genome Res. 2000 Apr;10(4):543-6. doi: 10.1101/gr.10.4.543.

MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region.黑腹果蝇2.9兆碱基乙醇脱氢酶（Adh）区域的MAGPIE/EGRET注释

Genome Res. 2000 Apr;10(4):502-10. doi: 10.1101/gr.10.4.502.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Promoter prediction on a genomic scale--the Adh experience.基因组尺度上的启动子预测——乙醇脱氢酶的经验

Genome Res. 2000 Apr;10(4):539-42. doi: 10.1101/gr.10.4.539.

Ab initio gene finding in Drosophila genomic DNA.在果蝇基因组DNA中进行从头基因预测。

Genome Res. 2000 Apr;10(4):516-22. doi: 10.1101/gr.10.4.516.

Using GeneWise in the Drosophila annotation experiment.在果蝇注释实验中使用GeneWise。

Genome Res. 2000 Apr;10(4):547-8. doi: 10.1101/gr.10.4.547.

Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome.基于同源性的注释在黑腹果蝇基因组中产生了1042个新的候选基因。

Nat Genet. 2001 Mar;27(3):337-40. doi: 10.1038/85922.

An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region.黑腹果蝇基因组2.9兆碱基区域序列的探索：乙醇脱氢酶区域

Genetics. 1999 Sep;153(1):179-219. doi: 10.1093/genetics/153.1.179.

引用本文的文献

kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq.用于定量分析批量、单细胞和单细胞核RNA测序的kallisto、bustools和kb-python。

Nat Protoc. 2025 Mar;20(3):587-607. doi: 10.1038/s41596-024-01057-0. Epub 2024 Oct 10.

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification.系统评估长读 RNA-seq 方法在转录本鉴定和定量中的应用。

Nat Methods. 2024 Jul;21(7):1349-1363. doi: 10.1038/s41592-024-02298-3. Epub 2024 Jun 7.

kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq.用于定量分析批量、单细胞和单细胞核RNA测序的kallisto、bustools和kb-python。

bioRxiv. 2024 Jan 23:2023.11.21.568164. doi: 10.1101/2023.11.21.568164.

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification.用于转录本鉴定和定量的长读长RNA测序方法的系统评估。

bioRxiv. 2023 Jul 27:2023.07.25.550582. doi: 10.1101/2023.07.25.550582.

Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models.全面的基因结构分析：自动预测和手动注释基因模型的比较案例研究。

BMC Genomics. 2019 Oct 17;20(1):753. doi: 10.1186/s12864-019-6064-8.

SPTEdb: a database for transposable elements in salicaceous plants.SPTEdb：杨柳科植物转座子数据库。

Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay024.

Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis.热带爪蟾基因组的重新支架搭建和重新注释达到了体内染色质相互作用分析所需的分辨率。

PLoS One. 2015 Sep 8;10(9):e0137526. doi: 10.1371/journal.pone.0137526. eCollection 2015.

SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models.雪鸮：通过使用RNA测序和同源性信息在从头预测模型中进行选择来准确预测真菌基因。

BMC Bioinformatics. 2014 Jul 1;15:229. doi: 10.1186/1471-2105-15-229.

Orthopoxvirus genome evolution: the role of gene loss.正痘病毒基因组进化：基因缺失的作用。

Viruses. 2010 Sep;2(9):1933-1967. doi: 10.3390/v2091933. Epub 2010 Sep 15.

Elevated Evolutionary Rates among Functionally Diverged Reproductive Genes across Deep Vertebrate Lineages.深层脊椎动物谱系中功能分化的生殖基因间进化速率升高。

Int J Evol Biol. 2011;2011:274975. doi: 10.4061/2011/274975. Epub 2011 Jul 28.

本文引用的文献

Stochastic segment models of eukaryotic promoter regions.真核生物启动子区域的随机片段模型。

Pac Symp Biocomput. 2000:380-91. doi: 10.1142/9789814447331_0036.

Using GeneWise in the Drosophila annotation experiment.在果蝇注释实验中使用GeneWise。

Genome Res. 2000 Apr;10(4):547-8. doi: 10.1101/gr.10.4.547.

Drosophila genomic sequence annotation using the BLOCKS+ database.使用BLOCKS+数据库对果蝇基因组序列进行注释。

Genome Res. 2000 Apr;10(4):543-6. doi: 10.1101/gr.10.4.543.

Genie--gene finding in Drosophila melanogaster.精灵——黑腹果蝇中的基因发现

Genome Res. 2000 Apr;10(4):529-38. doi: 10.1101/gr.10.4.529.

Ab initio gene finding in Drosophila genomic DNA.在果蝇基因组DNA中进行从头基因预测。

Genome Res. 2000 Apr;10(4):516-22. doi: 10.1101/gr.10.4.516.

GeneID in Drosophila.果蝇中的基因标识符。

Genome Res. 2000 Apr;10(4):511-5. doi: 10.1101/gr.10.4.511.

Gene-finding approaches for eukaryotes.真核生物的基因寻找方法。

Genome Res. 2000 Apr;10(4):394-7. doi: 10.1101/gr.10.4.394.

A biologist's view of the Drosophila genome annotation assessment project.生物学家对果蝇基因组注释评估项目的看法。

Genome Res. 2000 Apr;10(4):391-3. doi: 10.1101/gr.10.4.391.

The eukaryotic promoter database (EPD).真核生物启动子数据库（EPD）。

Nucleic Acids Res. 2000 Jan 1;28(1):302-3. doi: 10.1093/nar/28.1.302.

The Pfam protein families database.Pfam蛋白质家族数据库。

Nucleic Acids Res. 2000 Jan 1;28(1):263-6. doi: 10.1093/nar/28.1.263.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验