Lee Sangseon, Lee Taeheon, Noh Yung-Kyun, Kim Sun
IEEE/ACM Trans Comput Biol Bioinform. 2021 May-Jun;18(3):1174-1183. doi: 10.1109/TCBB.2019.2938949. Epub 2021 Jun 3.
Existing k-mer based string kernel methods have been successfully used for sequence comparison. However, existing kernel methods have limitations for comparative and evolutionary comparisons of genomes due to the sensitiveness to over-represented k-mers and variable sequence lengths.
In this study, we propose a novel ranked k-spectrum string (RKSS) kernel. 1) RKSS kernel utilizes common k-mer sets across species, named landmarks, that can be used for comparing multiple genomes. 2) Based on the landmarks, we can use ranks of k-mers, rather than frequencies, that can produce more robust distances between genomes. To show the power of RKSS kernel, we conducted two experiments using 10 mammalian species with exon, intron, and CpG island sequences. RKSS kernel reconstructed more consistent evolutionary trees than the k-spectrum string kernel. In the subsequent experiment, for each sequence, kernel distance was calculated from 30 landmarks representing exon, intron, and CpG island sequences of 10 genomes. Based on kernel distances, concordance tests were performed and the result suggested that more information is conserved in CpG islands across species than in introns. In conclusion, our analysis suggests that the relational order, exon CpG island intron, in terms of evolutionary information contents.
现有的基于k-mer的字符串核方法已成功用于序列比较。然而,由于对过度代表性的k-mer和可变序列长度敏感,现有的核方法在基因组的比较和进化比较方面存在局限性。
在本研究中,我们提出了一种新颖的排序k谱字符串(RKSS)核。1)RKSS核利用跨物种的常见k-mer集,即地标,可用于比较多个基因组。2)基于这些地标,我们可以使用k-mer的排名而不是频率,这可以在基因组之间产生更稳健的距离。为了展示RKSS核的强大功能,我们使用10种哺乳动物的外显子、内含子和CpG岛序列进行了两项实验。RKSS核重建的进化树比k谱字符串核更一致。在随后的实验中,对于每个序列,从代表10个基因组的外显子、内含子和CpG岛序列的30个地标计算核距离。基于核距离进行一致性测试,结果表明跨物种的CpG岛中比内含子中保留了更多信息。总之,我们的分析表明,就进化信息含量而言,存在外显子-CpG岛-内含子的关系顺序。