Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, NY 10021, USA.
Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, NY 10021, USA.
Nucleic Acids Res. 2022 Oct 14;50(18):e108. doi: 10.1093/nar/gkac653.
Recent pan-genome studies have revealed an abundance of DNA sequences in human genomes that are not present in the reference genome. A lion's share of these non-reference sequences (NRSs) cannot be reliably assembled or placed on the reference genome. Improvements in long-read and synthetic long-read (aka linked-read) technologies have great potential for the characterization of NRSs. While synthetic long reads require less input DNA than long-read datasets, they are algorithmically more challenging to use. Except for computationally expensive whole-genome assembly methods, there is no synthetic long-read method for NRS detection. We propose a novel integrated alignment-based and local assembly-based algorithm, Novel-X, that uses the barcode information encoded in synthetic long reads to improve the detection of such events without a whole-genome de novo assembly. Our evaluations demonstrate that Novel-X finds many non-reference sequences that cannot be found by state-of-the-art short-read methods. We applied Novel-X to a diverse set of 68 samples from the Polaris HiSeq 4000 PGx cohort. Novel-X discovered 16 691 NRS insertions of size > 300 bp (total length 18.2 Mb). Many of them are population specific or may have a functional impact.
最近的全基因组研究揭示了人类基因组中存在大量参考基因组中不存在的 DNA 序列。这些非参考序列(NRSs)中很大一部分无法可靠地组装或定位到参考基因组上。长读长和合成长读(又名链接读)技术的改进对于 NRSs 的特征描述具有巨大的潜力。虽然合成长读需要的输入 DNA 比长读数据集少,但在算法上使用起来更具挑战性。除了计算成本高昂的全基因组组装方法外,目前还没有用于 NRS 检测的合成长读方法。我们提出了一种新颖的基于整合比对和局部组装的算法 Novel-X,该算法利用合成长读中编码的条形码信息来改进此类事件的检测,而无需进行全基因组从头组装。我们的评估表明,Novel-X 可以发现许多无法通过最先进的短读方法找到的非参考序列。我们将 Novel-X 应用于来自 Polaris HiSeq 4000 PGx 队列的 68 个多样化样本。Novel-X 发现了 16691 个大小大于 300bp 的 NRS 插入(总长度为 18.2Mb)。其中许多是特定于人群的,或者可能具有功能影响。