Meleshko Dmitry, Yang Rui, Maharjan Salil, Danko David C, Korobeynikov Anton, Hajirasouliha Iman
Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, NY 10021, United States.
Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, NY 10021, United States.
Bioinform Adv. 2025 Jul 4;5(1):vbaf151. doi: 10.1093/bioadv/vbaf151. eCollection 2025.
Recent benchmarks show that most structural variations, especially within 50-10,000 bp range cannot be resolved with short-read sequencing, but long-read structural variant callers perform better on the same datasets. However, high-coverage long-read sequencing is costly and requires substantial input DNA. Reducing coverage lowers cost but significantly impacts the performance of existing structural variation (SV) callers. Synthetic long-read technologies offer long-range information at lower cost, but leveraging them for SVs under 50 kbp remains challenging.
We propose a novel hybrid alignment- and local-assembly-based algorithm, Blackbird, that uses synthetic long reads and low-coverage long reads to improve structural variant detection. Instead of relying on whole-genome assembly, Blackbird uses a sliding window approach and synthetic long-read barcode information to assemble local segments, integrating long reads to improve structural variant detection accuracy. We evaluated Blackbird on real human genome datasets. On the HG002 Genome in a Bottle (GIAB) benchmark, Blackbird in hybrid mode demonstrated results comparable to state-of-the-art long-read tools, while using less long-read coverage. Blackbird requires only 5 coverage to achieve F1-scores (0.835 and 0.808 for deletions and insertions) similar to PBSV and Sniffles2 using 10 PacBio Hi-Fi long-read coverage.
Blackbird is available at https://github.com/1dayac/Blackbird.
最近的基准测试表明,大多数结构变异,尤其是在50 - 10000 bp范围内的变异,无法通过短读长测序解析,但长读长结构变异检测工具在相同数据集上表现更好。然而,高覆盖率的长读长测序成本高昂,且需要大量的输入DNA。降低覆盖率虽能降低成本,但会显著影响现有结构变异(SV)检测工具的性能。合成长读长技术能以较低成本提供长距离信息,但将其用于50 kbp以下的SV检测仍具有挑战性。
我们提出了一种新颖的基于混合比对和局部组装的算法Blackbird,它使用合成长读长和低覆盖率长读长来改进结构变异检测。Blackbird不依赖全基因组组装,而是采用滑动窗口方法和合成长读长条形码信息来组装局部片段,整合长读长以提高结构变异检测的准确性。我们在真实人类基因组数据集上对Blackbird进行了评估。在HG002基因组瓶中基因组(GIAB)基准测试中,混合模式下的Blackbird展示了与最先进的长读长工具相当的结果,同时使用的长读长覆盖率更低。Blackbird仅需5倍覆盖率就能实现与使用10倍PacBio Hi-Fi长读长覆盖率的PBSV和Sniffles2相似的F1分数(缺失和插入分别为0.835和0.808)。
Blackbird可在https://github.com/1dayac/Blackbird获取。