Suppr超能文献

人类基因组序列中基于EST的基因注释分析

Analysis of EST-driven gene annotation in human genomic sequence.

作者信息

Bailey L C, Searls D B, Overton G C

机构信息

Computational Biology and Informatics Laboratory, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA.

出版信息

Genome Res. 1998 Apr;8(4):362-76. doi: 10.1101/gr.8.4.362.

Abstract

We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.

摘要

我们通过与表达序列标签(EST)进行相似性搜索,对基因组序列中的基因识别进行了系统分析,以评估该方法用于人类基因组自动注释的适用性。构建了一种基于BLAST的策略来检验这种方法的潜力,并将其应用于测试集,该测试集包含公共数据库中所有长度超过5 kb的人类基因组序列,以及300 kb经过详尽表征的基准序列。在高严格度下,通过与EST序列的近乎完全相同可检测到所有注释基因的70%-90%;与注释良好的序列比对的EST中,>95%与一个基因重叠。这些EST可直接获取相应的cDNA克隆,用于后续的实验室验证和生物学分析。在较低严格度下,高达97%的注释基因可通过与EST的相似性来识别。在最低严格度下,所有序列中EST的明显假阳性率升至55%,基准序列中为20%,这表明公共数据库条目中的许多基因未被注释。大约一半的比对跨越多个外显子,因此有助于构建基因预测并阐明可变剪接。此外,来自多个cDNA文库的EST经常聚集在基因上,为粗略的表达谱提供了一个起点。克隆ID可用于形成EST对,特别是通过将低严格度的比对与高质量的比对相关联来扩展模型。这些结果表明,EST相似性搜索是一种实用的通用注释技术,可作为一种基因表征工具补充模式识别方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验