Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
BMC Genomics. 2011 Jul 25;12:375. doi: 10.1186/1471-2164-12-375.
Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.
We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.
Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.
最近的研究表明,插入、缺失和其他更复杂的结构变异(SVs)在人类群体中具有遗传意义。随着下一代测序技术的发展,在全基因组水平上进行 SV 的高通量调查成为可能。在这里,我们提出了基于序列的 SV 检测方法——分割读取识别,校准(SRiC)。
我们首先以标准方式使用缺口比对将每个读取映射到参考基因组上。然后,为了识别 SV,我们使用一种评估策略对许多初始映射中的每一个进行评分,该策略旨在考虑测序和比对错误(例如,对在读取中心缺口的事件进行更高的评分)。所有当前的 SV 调用方法由于实验和计算上的限制,在其识别中都存在多层次的偏差(例如,比插入更频繁地调用缺失)。我们方法的一个关键方面是,我们针对从高通量测序模拟(具有现实错误模型)生成的合成数据集校准所有的调用。这使我们能够在不同的参数值情况下和不同类别的事件(例如,长缺失与短插入)下计算灵敏度和阳性预测值。我们在来自 1000 基因组计划的代表性数据上运行我们的计算。将染色体 1 上观察到的事件数量与从模拟中获得的校准值(针对不同长度的事件)相结合,使我们能够构建出一个在广泛的长度范围内对人类基因组中 SV 总数的相对无偏估计。我们特别估计,一个个体基因组包含约 670,000 个插入缺失/SVs。
与现有的用于 SV 识别的读深度和读对方法相比,我们的方法可以精确定位 SV 事件的精确断点,揭示插入的实际序列内容,并覆盖缺失的整个大小谱。此外,随着产生更长读取的第三代测序技术的出现,我们预计我们的方法将更加有用。