Mohamed Hashim Ezzeddin Kamil, Abdullah Rosni
School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia.
School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia; National Advanced IPv6 Centre of Excellence (NAv6), School of Computer Sciences Building, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia.
J Theor Biol. 2015 Dec 21;387:88-100. doi: 10.1016/j.jtbi.2015.09.014. Epub 2015 Sep 30.
Empirical analysis on k-mer DNA has been proven as an effective tool in finding unique patterns in DNA sequences which can lead to the discovery of potential sequence motifs. In an extensive study of empirical k-mer DNA on hundreds of organisms, the researchers found unique multi-modal k-mer spectra occur in the genomes of organisms from the tetrapod clade only which includes all mammals. The multi-modality is caused by the formation of the two lowest modes where k-mers under them are referred as the rare k-mers. The suppression of the two lowest modes (or the rare k-mers) can be attributed to the CG dinucleotide inclusions in them. Apart from that, the rare k-mers are selectively distributed in certain genomic features of CpG Island (CGI), promoter, 5' UTR, and exon. We correlated the rare k-mers with hundreds of annotated features using several bioinformatic tools, performed further intrinsic rare k-mer analyses within the correlated features, and modeled the elucidated rare k-mer clustering feature into a classifier to predict the correlated CGI and promoter features. Our correlation results show that rare k-mers are highly associated with several annotated features of CGI, promoter, 5' UTR, and open chromatin regions. Our intrinsic results show that rare k-mers have several unique topological, compositional, and clustering properties in CGI and promoter features. Finally, the performances of our RWC (rare-word clustering) method in predicting the CGI and promoter features are ranked among the top three, in eight of the CGI and promoter evaluations, among eight of the benchmarked datasets.
对k-mer DNA的实证分析已被证明是一种有效的工具,可用于在DNA序列中发现独特模式,从而有助于发现潜在的序列基序。在对数百种生物体的k-mer DNA进行的广泛研究中,研究人员发现,只有四足动物进化枝(包括所有哺乳动物)的生物体基因组中出现了独特的多峰k-mer光谱。多峰性是由两个最低模式的形成引起的,其下的k-mer被称为稀有k-mer。两个最低模式(或稀有k-mer)的抑制可归因于其中包含的CG二核苷酸。除此之外,稀有k-mer选择性地分布在CpG岛(CGI)、启动子、5'非翻译区(UTR)和外显子的某些基因组特征中。我们使用几种生物信息学工具将稀有k-mer与数百个注释特征进行关联,在相关特征内进行进一步的内在稀有k-mer分析,并将阐明的稀有k-mer聚类特征建模为分类器,以预测相关的CGI和启动子特征。我们的关联结果表明,稀有k-mer与CGI、启动子、5'UTR和开放染色质区域的几个注释特征高度相关。我们的内在结果表明,稀有k-mer在CGI和启动子特征中具有几种独特的拓扑、组成和聚类特性。最后,在八个基准数据集的八个CGI和启动子评估中,我们的RWC(稀有词聚类)方法在预测CGI和启动子特征方面的性能排名前三。