College of Computer Science, Sichuan University, Chengdu, China.
School of Computer and Information Science, Southwest University, Chongqing, China.
Bioinformatics. 2018 Nov 1;34(21):3624-3630. doi: 10.1093/bioinformatics/bty392.
This study addresses several important questions related to naturally underrepresented sequences: (i) are there permutations of real genomic DNA sequences in a defined length (k-mer) and a given lineage that do not actually exist or underrepresented? (ii) If there are such sequences, what are their characteristics in terms of k-mer length and base composition? (iii) Are they related to CpG or TpA underrepresentation known for human sequences? We propose that the answers to these questions are of great significance for the study of sequence-associated regulatory mechanisms, such cytosine methylation and chromosomal structures in physiological or pathological conditions such as cancer.
We empirically defined sequences that were not included in any well-known public databases as lineage-associated underrepresented permutations (LAUPs). Then, we developed a Jellyfish-based LAUPs analysis application (JBLA) to investigate LAUPs for 24 representative species. The present discoveries include: (i) lengths for the shortest LAUPs, ranging from 10 to 14, which collectively constitute a low proportion of the genome. (ii) Common LAUPs showing higher CG content over the analysed mammalian genome and possessing distinct CG*CG motifs. (iii) Neither CpG-containing LAUPs nor CpG island sequences are randomly structured and distributed over the genomes; some LAUPs and most CpG-containing sequences exhibit an opposite trend within the same k and n variants. In addition, we demonstrate that the JBLA algorithm is more efficient than the original Jellyfish for computing LAUPs.
We developed a Jellyfish-based LAUP analysis (JBLA) application by integrating Jellyfish (Marçais and Kingsford, 2011), MEME (Bailey, et al., 2009) and the NCBI genome database (Pruitt, et al., 2007) applications, which are listed as Supplementary Material.
Supplementary data are available at Bioinformatics online.
本研究解决了与自然代表性不足序列相关的几个重要问题:(i) 在给定的谱系中,是否存在实际不存在或代表性不足的特定长度(k-mer)的真基因组 DNA 序列的排列?(ii) 如果存在这样的序列,它们在 k-mer 长度和碱基组成方面有什么特点?(iii) 它们与人类序列中已知的 CpG 或 TpA 代表性不足有关吗?我们认为,这些问题的答案对于研究与序列相关的调节机制(如胞嘧啶甲基化和生理或病理条件下的染色体结构)具有重要意义,如癌症。
我们根据经验将未包含在任何已知公共数据库中的序列定义为谱系相关代表性不足的排列(LAUPs)。然后,我们开发了一个基于 Jellyfish 的 LAUPs 分析应用程序(JBLA),用于研究 24 个代表性物种的 LAUPs。本研究的发现包括:(i) 最短 LAUPs 的长度为 10 到 14,它们共同构成了基因组的一小部分。(ii) 常见的 LAUPs 在分析的哺乳动物基因组中表现出较高的 CG 含量,并具有独特的 CG*CG 基序。(iii) 既不含 CpG 的 LAUPs 也不含 CpG 岛序列是随机结构的,分布在基因组中;一些 LAUPs 和大多数含 CpG 的序列在相同的 k 和 n 变体中表现出相反的趋势。此外,我们证明了 JBLA 算法在计算 LAUPs 方面比原始 Jellyfish 更有效。
我们通过整合 Jellyfish(Marçais 和 Kingsford,2011)、MEME(Bailey 等人,2009)和 NCBI 基因组数据库(Pruitt 等人,2007)应用程序,开发了一个基于 Jellyfish 的 LAUP 分析(JBLA)应用程序,这些程序列在补充材料中。
补充数据可在 Bioinformatics 在线获取。