Department of Computer Science, Brown University, Providence, RI 02912, USA.
Bioinformatics. 2010 May 15;26(10):1291-8. doi: 10.1093/bioinformatics/btq153. Epub 2010 Apr 8.
Structural variation including deletions, duplications and rearrangements of DNA sequence are an important contributor to genome variation in many organisms. In human, many structural variants are found in complex and highly repetitive regions of the genome making their identification difficult. A new sequencing technology called strobe sequencing generates strobe reads containing multiple subreads from a single contiguous fragment of DNA. Strobe reads thus generalize the concept of paired reads, or mate pairs, that have been routinely used for structural variant detection. Strobe sequencing holds promise for unraveling complex variants that have been difficult to characterize with current sequencing technologies.
We introduce an algorithm for identification of structural variants using strobe sequencing data. We consider strobe reads from a test genome that have multiple possible alignments to a reference genome due to sequencing errors and/or repetitive sequences in the reference. We formulate the combinatorial optimization problem of finding the minimum number of structural variants in the test genome that are consistent with these alignments. We solve this problem using an integer linear program. Using simulated strobe sequencing data, we show that our algorithm has better sensitivity and specificity than paired read approaches for structural variation identification.
包括 DNA 序列缺失、重复和重排在内的结构变异是许多生物体基因组变异的一个重要贡献因素。在人类中,许多结构变体存在于基因组的复杂和高度重复区域,使得它们的识别变得困难。一种称为频闪测序的新型测序技术从单个连续 DNA 片段生成包含多个子读数的频闪读数。因此,频闪读数扩展了已常规用于结构变体检测的配对读数或配对的概念。频闪测序有望解决当前测序技术难以描述的复杂变体。
我们介绍了一种使用频闪测序数据识别结构变体的算法。我们考虑由于测序错误和/或参考基因组中的重复序列而在测试基因组中具有多个可能与参考基因组对齐的频闪读数。我们将找到与这些对齐一致的测试基因组中最小数量的结构变体的组合优化问题制定出来。我们使用整数线性规划来解决这个问题。使用模拟的频闪测序数据,我们表明,我们的算法在结构变异识别方面的灵敏度和特异性均优于配对读取方法。