Suppr超能文献

使用GRAIL II在基因组序列中识别外显子。

Recognizing exons in genomic sequence using GRAIL II.

作者信息

Xu Y, Mural R, Shah M, Uberbacher E

机构信息

Engineering Physics and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364.

出版信息

Genet Eng (N Y). 1994;16:241-53.

PMID:7765200
Abstract

We have described an improved neural network system for recognizing protein coding regions (exons) in human genomic DNA sequences. This coding region recognition system is part of a new version of GRAIL, GRAIL II, and represents a significant improvement over the coding recognition performance of the previous GRAIL system. GRAIL II divides the process of locating exons into four steps. It first generates an exon candidate pool consisting of all possible (translation start-donor), (acceptor-donor), and (acceptor-translation stop) pairs within all open reading frames of the test sequence. The vast majority of these exon candidates are eliminated from consideration by applying a set of heuristic rules. After reducing the size of the candidate pool, GRAIL II uses three trained neural networks to evaluate the coding potential and accuracy of the edges of starting exon, internal exon and terminal exon candidates. These networks output a set of overlapping candidates for each exon which differ by their scores and position of their edges. Multiple candidates for a given exon are grouped into a cluster based on their locations relative to candidates corresponding to other exons, and the highest scoring candidate for each cluster is used as the "best" prediction of the corresponding exon. Unlike the previous GRAIL version, GRAIL II uses variable-length windows to evaluate exon candidates and its performance is nearly independent of exon length. In addition to several strong indicators of coding potential, the system uses several other types of information including scores for splice junctions, GC composition, and the properties of the regions adjacent to an exon candidate, to aid in the discrimination process. On a large set of sequences from Genbank (3), GRAIL II located 93% of all exons regardless of size with a false positive rate of 12%. Among the true positives, 62% match the actual exons exactly (the exons edges are correct to the base), and 93% match at least one edge correctly. These statistics are further improved, especially the false positive rate and accuracy of the edges, through a process of gene model construction by the Gene Assembly Program (GAP III) (4) module of GRAIL II, which uses the scored exon candidates as input and constructs optimal gene models. The gene modeling system will be described elsewhere.

摘要

我们已经描述了一种用于识别人类基因组DNA序列中蛋白质编码区域(外显子)的改进型神经网络系统。这种编码区域识别系统是新版GRAIL(GRAIL II)的一部分,相较于之前的GRAIL系统的编码识别性能有显著提升。GRAIL II将外显子定位过程分为四个步骤。它首先生成一个外显子候选池,该候选池由测试序列所有开放阅读框内的所有可能的(翻译起始-供体)、(受体-供体)和(受体-翻译终止)对组成。通过应用一组启发式规则,这些外显子候选中的绝大多数被排除在考虑范围之外。在缩小候选池规模后,GRAIL II使用三个经过训练的神经网络来评估起始外显子、内部外显子和末端外显子候选的编码潜力和边缘准确性。这些网络为每个外显子输出一组重叠的候选,它们因得分和边缘位置而不同。给定外显子的多个候选基于它们相对于对应于其他外显子的候选的位置被分组到一个簇中,并且每个簇中得分最高的候选被用作相应外显子的“最佳”预测。与之前的GRAIL版本不同,GRAIL II使用可变长度窗口来评估外显子候选,并且其性能几乎与外显子长度无关。除了几个编码潜力的强指标外,该系统还使用其他几种类型的信息,包括剪接连接得分、GC组成以及外显子候选相邻区域的特性,以辅助判别过程。在来自Genbank的一大组序列上,GRAIL II定位了所有外显子的93%,无论其大小如何,假阳性率为12%。在真阳性中,62%与实际外显子完全匹配(外显子边缘在碱基水平上正确),93%至少有一个边缘匹配正确。通过GRAIL II的基因组装程序(GAP III)(4)模块进行基因模型构建的过程,这些统计数据得到了进一步改善,特别是假阳性率和边缘准确性,该模块使用评分后的外显子候选作为输入并构建最优基因模型。基因建模系统将在其他地方进行描述。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验