Suzuki Yuta, Korlach Jonas, Turner Stephen W, Tsukahara Tatsuya, Taniguchi Junko, Qu Wei, Ichikawa Kazuki, Yoshimura Jun, Yurino Hideaki, Takahashi Yuji, Mitsui Jun, Ishiura Hiroyuki, Tsuji Shoji, Takeda Hiroyuki, Morishita Shinichi
Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8583, Japan.
Pacific Biosciences, Menlo Park, CA 94025, USA.
Bioinformatics. 2016 Oct 1;32(19):2911-9. doi: 10.1093/bioinformatics/btw360. Epub 2016 Jun 17.
Determining the methylation state of regions with high copy numbers is challenging for second-generation sequencing, because the read length is insufficient to map reads uniquely, especially when repetitive regions are long and nearly identical to each other. Single-molecule real-time (SMRT) sequencing is a promising method for observing such regions, because it is not vulnerable to GC bias, it produces long read lengths, and its kinetic information is sensitive to DNA modifications.
We propose a novel linear-time algorithm that combines the kinetic information for neighboring CpG sites and increases the confidence in identifying the methylation states of those sites. Using a practical read coverage of ∼30-fold from an inbred strain medaka (Oryzias latipes), we observed that both the sensitivity and precision of our method on individual CpG sites were ∼93.7%. We also observed a high correlation coefficient (R = 0.884) between our method and bisulfite sequencing, and for 92.0% of CpG sites, methylation levels ranging over [0,1] were in concordance within an acceptable difference 0.25. Using this method, we characterized the landscape of the methylation status of repetitive elements, such as LINEs, in the human genome, thereby revealing the strong correlation between CpG density and hypomethylation and detecting hypomethylation hot spots of LTRs and LINEs. We uncovered the methylation states for nearly identical active transposons, two novel LINE insertions of identity ∼99% and length 6050 base pairs (bp) in the human genome, and 16 Tol2 elements of identity >99.8% and length 4682 bp in the medaka genome.
AgIn (Aggregate on Intervals) is available at: https://github.com/hacone/AgIn
ysuzuki@cb.k.u-tokyo.ac.jp or moris@cb.k.u-tokyo.ac.jp
Supplementary data are available at Bioinformatics online.
对于第二代测序而言,确定高拷贝数区域的甲基化状态具有挑战性,因为读长不足以唯一地映射 reads,特别是当重复区域较长且彼此几乎相同时。单分子实时(SMRT)测序是观察此类区域的一种有前途的方法,因为它不易受 GC 偏差影响,能产生长读长,且其动力学信息对 DNA 修饰敏感。
我们提出了一种新颖的线性时间算法,该算法结合了相邻 CpG 位点的动力学信息,并提高了识别这些位点甲基化状态的置信度。使用来自近交系青鳉(Oryzias latipes)的约 30 倍实际读覆盖度,我们观察到我们的方法在单个 CpG 位点上的灵敏度和精度均约为 93.7%。我们还观察到我们的方法与亚硫酸氢盐测序之间具有较高的相关系数(R = 0.884),并且对于 92.0%的 CpG 位点,范围在[0,1]内的甲基化水平在可接受差异 0.25 内是一致的。使用这种方法,我们对人类基因组中重复元件(如 LINEs)的甲基化状态格局进行了表征,从而揭示了 CpG 密度与低甲基化之间的强相关性,并检测到 LTRs 和 LINEs 的低甲基化热点。我们揭示了人类基因组中几乎相同的活跃转座子、两个同一性约为 99%且长度为 6050 碱基对(bp)的新型 LINE 插入以及青鳉基因组中 16 个同一性>99.8%且长度为 4682 bp 的 Tol2 元件的甲基化状态。
AgIn(区间聚合)可在以下网址获取:https://github.com/hacone/AgIn
ysuzuki@cb.k.u-tokyo.ac.jp 或 moris@cb.k.u-tokyo.ac.jp
补充数据可在《生物信息学》在线获取。