Müller H-M, Koonin S E
Division of Biology and W. K. Kellogg Radiation Laboratory, California Institute of Technology, 1201 East California Boulevard, Pasadena, CA 91125, USA.
J Theor Biol. 2003 Jul 21;223(2):161-9. doi: 10.1016/s0022-5193(03)00082-1.
Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.
重新审视内含子-外显子识别问题,我们使用主成分分析(PCA)对DNA序列进行分类,并展示了验证我们方法的初步结果。序列被转换为表示其词内容的文档向量;然后主成分分析定义高斯分布的序列类别。分类利用词内容和词用法的变化来区分序列。我们用几个基因组DNA数据集测试了我们的方法,能够以高达96%的准确率对内含子和外显子进行分类。我们将该方法与最佳传统编码度量——非重叠六聚体频率计数进行比较,发现PCA方法产生了更好的结果。我们还研究了不同内含子和外显子数据集之间的交叉验证程度,并发现有证据表明可以检测到数据集的质量。