Solovyev V V, Salamov A A, Lawrence C B
Department of Cell Biology, Baylor College of Medicine, Houston, TX 77030, USA.
Proc Int Conf Intell Syst Mol Biol. 1995;3:367-75.
Development of advanced technique to identify gene structure is one of the main challenges of the Human Genome Project. Discriminant analysis was applied to the construction of recognition functions for various components of gene structure. Linear discriminant functions for splice sites, 5'-coding, internal exon, and 3'-coding region recognition have been developed. A gene structure prediction system FGENE has been developed based on the exon recognition functions. We compute a graph of mutual compatibility of different exons and present a gene structure models as paths of this directed acyclic graph. For an optimal model selection we apply a variant of dynamic programming algorithm to search for the path in the graph with the maximal value of the corresponding discriminant functions. Prediction by FGENE for 185 complete human gene sequences has 81% exact exon recognition accuracy and 91% accuracy at the level of individual exon nucleotides with the correlation coefficient (C) equals 0.90. Testing FGENE on 35 genes not used in the development of discriminant functions shows 71% accuracy of exact exon prediction and 89% at the nucleotide level (C = 0.86). FGENE compares very favorably with the other programs currently used to predict protein-coding regions. Analysis of uncharacterized human sequences based on our methods for splice site (HSPL, RNASPL), internal exons (HEXON), all type of exons (FEXH) and human (FGENEH) and bacterial (CDSB) gene structure prediction and recognition of human and bacterial sequences (HBR) (to test a library for E. coli contamination) is available through the University of Houston, Weizmann Institute of Science network server and a WWW page of the Human Genome Center at Baylor College of Medicine.
开发用于识别基因结构的先进技术是人类基因组计划的主要挑战之一。判别分析被应用于构建基因结构各个组成部分的识别函数。已经开发出用于剪接位点、5'编码区、内部外显子和3'编码区识别的线性判别函数。基于外显子识别函数开发了一个基因结构预测系统FGENE。我们计算不同外显子的相互兼容性图,并将基因结构模型表示为这个有向无环图的路径。为了选择最优模型,我们应用动态规划算法的一个变体来搜索图中具有相应判别函数最大值的路径。FGENE对185条完整人类基因序列的预测在精确外显子识别准确率方面达到81%,在单个外显子核苷酸水平的准确率为91%,相关系数(C)等于0.90。在35个未用于判别函数开发的基因上测试FGENE,精确外显子预测准确率为71%,在核苷酸水平为89%(C = 0.86)。FGENE与目前用于预测蛋白质编码区的其他程序相比具有很大优势。基于我们用于剪接位点(HSPL、RNASPL)、内部外显子(HEXON)、所有类型外显子(FEXH)以及人类(FGENEH)和细菌(CDSB)基因结构预测和识别的方法,以及人类和细菌序列识别(HBR)(用于测试大肠杆菌污染文库),可以通过休斯顿大学、魏茨曼科学研究所网络服务器以及贝勒医学院人类基因组中心的万维网页面进行对未表征人类序列的分析。