Elyanow Rebecca, Wu Hsin-Ta, Raphael Benjamin J
Center for Computational Molecular Biology, Brown University, Providence, RI, USA.
Department of Computer Science, Princeton University, Princeton, NJ, USA.
Bioinformatics. 2018 Jan 15;34(2):353-360. doi: 10.1093/bioinformatics/btx712.
Structural variation, including large deletions, duplications, inversions, translocations and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina short-read sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linked-read sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (∼5 to 10) DNA molecules ∼50 Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants.
We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linked-read sequencing data. NAIBR predicts novel adjacencies in an individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification-including two recent methods that also analyze linked-reads-on simulated sequencing data and 10X whole-genome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes.
Software is available at compbio.cs.brown.edu/software.
Supplementary data are available at Bioinformatics online.
结构变异,包括大片段缺失、重复、倒位、易位和其他重排,在人类基因组和癌症基因组中很常见。已经开发了许多方法来从Illumina短读长测序数据中识别结构变异。然而,可靠地识别结构变异仍然具有挑战性,因为许多变异在基因组的重复区域具有断点,因此难以用短读长进行识别。10X Genomics最近开发的连接读长测序技术将一种新颖的条形码策略与Illumina测序相结合。该技术用相同的分子条形码标记所有来自少数(约5至10个)长度约为50 Kbp的DNA分子的读长。这些带条形码的读长包含有利于识别结构变异的长程序列信息。
我们提出了带条形码读长的新型邻接识别算法(NAIBR),这是一种用于识别连接读长测序数据中结构变异的算法。NAIBR使用一种概率模型来预测个体基因组中由结构变异导致的新型邻接,该模型结合了带条形码读长中的多种信号。我们表明,NAIBR在结构变异识别方面优于几种现有方法,包括两种最近也分析连接读长的方法,在模拟测序数据以及来自NA12878人类基因组和HCC1954乳腺癌细胞系的10X全基因组测序数据上。在HCC1954中鉴定出的几个新型体细胞结构变异与已知的癌症基因重叠。
软件可在compbio.cs.brown.edu/software获取。
补充数据可在《生物信息学》在线获取。