Seo Hyein, Song Yong-Joon, Cho Kiho, Cho Dong-Ho
School of Electrical EngineeringKorea Advanced Institute of Science and Technology (KAIST) Daejeon 300-010 South Korea.
Department of SurgeryUniversity of California Sacramento California 95064 USA.
IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.
Individual characteristics are determined through a genome consisting of a complex base combination. This base combination is reflected in the k-word profile, which represents the number of consecutive k bases. Therefore, it is important to analyze the genome-specific statistical specificity in the k-word profile to understand the characteristics of the genome. In this paper, we propose a new k-word-based method to analyze genome-specific properties. We define k-words consisting of the same number of bases as statistically identical k-words. The statistically identical k-words are estimated to appear at a similar frequency by statistical prediction. However, this may not be true in the genome because it is not a random list of bases. The ratio between frequencies of two statistically identical k-words can then be used to investigate the statistical specificity of the genome reflected in the k-word profile. In order to find important ratios representing genomic characteristics, a reference value is calculated that results in a minimum error when classifying data by ratio alone. Finally, we propose a genetic algorithm-based search algorithm to select a minimum set of ratios useful for classification. The proposed method was applied to the full-length sequence of microorganisms for pathogenicity classification. The classification accuracy of the proposed algorithm was similar to that of conventional methods while using only a few features. We proposed a new method to investigate the genome-specific statistical specificity in the k-word profile which can be applied to find important properties of the genome and classify genome sequences.
个体特征是通过由复杂碱基组合构成的基因组来确定的。这种碱基组合反映在k字谱中,k字谱代表连续k个碱基的数量。因此,分析k字谱中基因组特异性的统计特异性对于理解基因组特征很重要。在本文中,我们提出了一种基于k字的新方法来分析基因组特异性属性。我们将由相同数量碱基组成的k字定义为统计上相同的k字。通过统计预测估计统计上相同的k字会以相似的频率出现。然而,在基因组中这可能并不成立,因为它不是一个随机的碱基列表。然后,两个统计上相同的k字的频率之比可用于研究k字谱中反映的基因组的统计特异性。为了找到代表基因组特征的重要比率,计算一个参考值,该参考值在仅按比率对数据进行分类时会导致最小误差。最后,我们提出了一种基于遗传算法的搜索算法来选择一组对分类有用的最小比率集。所提出的方法应用于微生物的全长序列进行致病性分类。所提出算法的分类准确率与传统方法相似,同时仅使用了少数特征。我们提出了一种新方法来研究k字谱中基因组特异性的统计特异性,该方法可用于发现基因组的重要属性并对基因组序列进行分类。