Liew Alan Wee-Chung, Wu Yonghui, Yan Hong, Yang Mengsu
Int J Bioinform Res Appl. 2005;1(2):181-201. doi: 10.1504/IJBRA.2005.007577.
This study performs a quantitative evaluation of the different coding features in terms of their information content for the classification of coding and non-coding regions for three species. Our study indicated that coding features that are effective for yeast or C. elegans are generally not very effective for human, which has a short average exon length. By performing a correlation analysis, we identified a subset of human coding features with high discriminative power, but complementary in their information content. For this subset, a classification accuracy of up to 90% was obtained using a simple kNN classifier.
本研究针对三种物种的编码区和非编码区分类,对不同编码特征的信息含量进行了定量评估。我们的研究表明,对酵母或秀丽隐杆线虫有效的编码特征通常对人类不太有效,因为人类的外显子平均长度较短。通过进行相关性分析,我们确定了一组具有高鉴别力但信息含量互补的人类编码特征。对于该子集,使用简单的kNN分类器可获得高达90%的分类准确率。