Korenberg Michael J, Lipson Edward D, Green James R, Solomon Jerry E
Department of Electrical and Computer Engineering, Queen's University, Kingston, Ontario, Canada.
Ann Biomed Eng. 2002 Jan;30(1):129-40. doi: 10.1114/1.1433490.
Many of the current procedures for detecting coding regions on human DNA sequences combine a number of individual techniques such as discriminant analysis and neural net methods. Recent papers have used techniques from nonlinear systems identification, in particular, parallel cascade identification (PCI), as one means for classifying protein sequences into their structure/function groups. In the present paper, PCI is used in a pilot study to distinguish exon (coding) from intron (noncoding; interspersed within genes) human DNA sequences. Only the first exon and first intron sequences with known boundaries in genomic DNA from the beta T-cell receptor locus were used for training. Then, the parallel cascade classifiers were able to achieve classification rates of about 89% on novel sequences in a test set, and averaged about 82% when results of a blind test were included. In testing over a much wider range of human nucleotide sequences, PCI classifiers averaged 83.6% correct classifications. These results indicate that parallel cascade classifiers may be useful components in future coding region detection programs.
目前许多用于检测人类DNA序列中编码区域的程序都结合了多种单独的技术,如判别分析和神经网络方法。最近的论文使用了非线性系统识别技术,特别是并行级联识别(PCI),作为将蛋白质序列分类到其结构/功能组的一种方法。在本文中,PCI被用于一项初步研究,以区分人类DNA序列中的外显子(编码)和内含子(非编码;散布在基因内)。仅使用来自βT细胞受体基因座的基因组DNA中具有已知边界的第一个外显子和第一个内含子序列进行训练。然后,并行级联分类器在测试集中对新序列的分类准确率约为89%,当纳入盲测结果时平均约为82%。在对更广泛的人类核苷酸序列进行测试时,PCI分类器的正确分类平均为83.6%。这些结果表明,并行级联分类器可能是未来编码区域检测程序中的有用组件。