Snyder E E, Stormo G D
Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder 80309-0347.
Nucleic Acids Res. 1993 Feb 11;21(3):607-13. doi: 10.1093/nar/21.3.607.
Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classification procedures are determined by training a simple feed-forward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.
动态规划(DP)被应用于精确识别基因组DNA序列中的内部外显子和内含子的问题。GeneParser程序首先对感兴趣的序列进行剪接位点以及这些内含子和外显子特异性含量指标的评分:密码子使用、局部组成复杂性、六联体频率、长度分布和周期性不对称性。然后,这些信息被组织起来以便由动态规划进行解读。GeneParser采用动态规划算法来强化内含子和外显子必须相邻且不重叠的约束,并找到在这些约束条件下得分最高的内含子和外显子组合。通过训练一个简单的前馈神经网络以最大化正确预测的数量来确定各种分类程序的权重。在一项初步研究中,该系统已在一组包含150个内部外显子、共计158,691个碱基对的基因组序列的56个人类基因片段上进行了训练。当针对训练数据进行测试时,GeneParser精确识别出75%的外显子,并正确地将86%的编码核苷酸预测为编码,而只有13%的非外显子碱基对被预测为编码。这对应于外显子预测的相关系数为0.85。由于网络加权方案的简单性,泛化性能几乎与训练集一样好。