Solovyev V V, Salamov A A, Lawrence C B
Department of Cell Biology, Baylor College of Medicine, Houston, TX 77030, USA.
Proc Int Conf Intell Syst Mol Biol. 1994;2:354-62.
Discriminant analysis is applied to the problem of recognition 5'-, internal and 3'-exons in human DNA sequences. Specific recognition functions were developed for revealing exons of particular types. The method based on a splice site prediction algorithm that uses the linear Fisher discriminant to combine the information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotides in protein coding and intron regions (Solovyev, Lawrence, 1994). The accuracy of our splice site recognition function is about 97%. A discriminant function for 5'-exon prediction includes hexanucleotide composition of upstream region, triplet composition around the ATG codon, ORF coding potential, donor splice site potential and composition of downstream intron region. For internal exon prediction, we combine in a discriminant function the characteristics describing the 5'-intron region, donor splice site, coding region, acceptor splice site and 3'-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79% and a level of pseudoexon ORF prediction of 99.96%. The recognition quality computed at the level of individual nucleotides is 89% for exon sequences and 98% for intron sequences. A discriminant function for 3'-exon prediction includes octanucleotide composition of upstream intron region, triplet composition around the stop codon, ORF coding potential, acceptor splice site potential and hexanucleotide composition of downstream region.(ABSTRACT TRUNCATED AT 250 WORDS)
判别分析应用于识别人类DNA序列中5'端、内部和3'端外显子的问题。开发了特定的识别函数以揭示特定类型的外显子。该方法基于一种剪接位点预测算法,该算法使用线性Fisher判别式来组合有关剪接位点区域各个功能部分的重要三联体频率以及蛋白质编码和内含子区域中寡核苷酸偏好的信息(索洛维耶夫、劳伦斯,1994年)。我们的剪接位点识别函数的准确率约为97%。用于5'端外显子预测的判别函数包括上游区域的六核苷酸组成、ATG密码子周围的三联体组成、开放阅读框编码潜力、供体剪接位点潜力和下游内含子区域的组成。对于内部外显子预测,我们在一个判别函数中结合了描述每个由GT和AG碱基对侧翼的开放阅读框的5'端内含子区域、供体剪接位点、编码区域、受体剪接位点和3'端内含子区域的特征。在一个由451个外显子和246693个假外显子序列组成的测试集上,精确识别内部外显子的准确率为77%,特异性为79%,假外显子开放阅读框预测水平为99.96%。在单个核苷酸水平上计算的外显子序列识别质量为89%,内含子序列为98%。用于3'端外显子预测的判别函数包括上游内含子区域的八核苷酸组成、终止密码子周围的三联体组成、开放阅读框编码潜力、受体剪接位点潜力和下游区域的六核苷酸组成。(摘要截至于250字)