Solovyev V V, Lawrence C B
Department of Cell Biology, Baylor College of Medicine, Houston, TX 77030, USA.
Proc Int Conf Intell Syst Mol Biol. 1993;1:371-9.
Accurate recognition of coding and intron regions within large regions of uncharacterized genomic DNA is an unsolved problem. A data base of more than 4,240,791 bp coding and 7,790,682 bp noncoding human sequences was extracted from GenBank to develop a function for locating coding regions in anonymous sequences. Several coding measures based on oligonucleotide preferences were tested on a control set that including 1/3 of all extracted sequences. An accuracy of separation of coding/noncoding regions is 87% for 9 bp oligonucleotides on 54 bp windows and 91% on 108 bp windows, respectively. For separation of coding/intron regions the accuracy is 89-90% for 8 bp oligonucleotides on 54 bp windows and up to 95% on 108 bp windows. Using the information about preferences of octanucleotides in protein coding and intron regions and significant triplet frequencies as a function of position near splice junctions, a joint splice site prediction scheme was developed. The accuracy of the joint scheme for predicting splice site positions on the test set was about 96-97%, which exceeds the accuracy of the previously reported splice site selection method based on a more complex artificial neural network approach. A model of splicing using poly-G(C) rich exon flanking sequences is suggested. A remarkable difference of oligonucleotide composition 5'- and 3'- gene regions is displayed and applied in a gene structure predictive system.
在大片未表征的基因组DNA中准确识别编码区和内含子区域是一个尚未解决的问题。从GenBank中提取了一个包含超过4240791 bp编码序列和7790682 bp非编码人类序列的数据库,以开发一种在匿名序列中定位编码区的功能。基于寡核苷酸偏好的几种编码度量方法在一个包含所有提取序列三分之一的对照集上进行了测试。对于54 bp窗口上的9 bp寡核苷酸,编码区/非编码区的分离准确率分别为87%,对于108 bp窗口则为91%。对于编码区/内含子区的分离,54 bp窗口上8 bp寡核苷酸的准确率为89 - 90%,108 bp窗口上高达95%。利用蛋白质编码区和内含子区八核苷酸偏好信息以及作为剪接连接点附近位置函数的显著三联体频率,开发了一种联合剪接位点预测方案。该联合方案在测试集上预测剪接位点位置的准确率约为96 - 97%,超过了先前报道的基于更复杂人工神经网络方法的剪接位点选择方法的准确率。提出了一种使用富含多聚G(C)的外显子侧翼序列的剪接模型。展示了5' - 和3' - 基因区域寡核苷酸组成的显著差异,并将其应用于基因结构预测系统。