Luo Can, Peters Brock A, Zhou Xin Maizie
Department of Biomedical Engineering, Vanderbilt University, Nashville, 37235, TN, USA.
Advanced Genomics Technology Lab, Complete Genomics Inc, 2904 Orchard Parkway, San Jose, 95134, CA, USA.
BMC Genomics. 2025 Mar 18;26(Suppl 2):263. doi: 10.1186/s12864-025-11398-z.
Linked-reads improve de novo assembly, haplotype phasing, structural variant (SV) detection, and other applications through highly-multiplexed genome partitioning and barcoding. Whole genome assembly and assembly-based variant detection based on linked-reads often require intensive computation costs and are not suitable for large population studies. Here we propose an efficient pipeline, RegionIndel, a region-based diploid assembly approach to characterize large indel SVs. This pipeline only focuses on target regions (50kb by default) to extract barcoded reads as input and then integrates a haplotyping algorithm and local assembly to generate phased diploid contiguous sequences (contigs). Finally, it detects variants in the contigs through a pairwise contig-to-reference comparison.
We applied RegionIndel on two linked-reads libraries of sample HG002, one using 10x and the other stLFR. HG002 is a well-studied sample and the Genome in a Bottle (GiaB) community provides a gold standard SV set for it. RegionIndel outperformed several assembly and alignment-based SV callers in our benchmark experiments. After assembling all indel SVs, RegionIndel achieved an overall F1 score of 74.8% in deletions and 61.8% in insertions for 10x linked-reads, and 64.3% in deletions and 36.7% in insertions for stLFR linked-reads, respectively. Furthermore, it achieved an overall genotyping accuracy of 83.6% and 80.8% for 10x and stLFR linked-reads, respectively.
RegionIndel can achieve diploid assembly and detect indel SVs in each target region. The phased diploid contigs can further allow us to investigate indel SVs with nearby linked single nucleotide polymorphism (SNPs) and small indels in the same haplotype.
通过高度多重的基因组分区和条形码技术,连接读长可改善从头组装、单倍型定相、结构变异(SV)检测及其他应用。基于连接读长的全基因组组装和基于组装的变异检测通常需要高昂的计算成本,不适用于大规模人群研究。在此,我们提出一种高效流程RegionIndel,这是一种基于区域的二倍体组装方法,用于表征大型插入缺失SV。该流程仅聚焦于目标区域(默认50kb),提取带条形码的读长作为输入,然后整合单倍型分型算法和局部组装以生成定相的二倍体连续序列(重叠群)。最后,通过将重叠群与参考序列进行成对比较来检测重叠群中的变异。
我们将RegionIndel应用于样本HG002的两个连接读长文库,一个使用10x技术,另一个使用stLFR技术。HG002是一个经过充分研究的样本,瓶中基因组(GiaB)社区为其提供了金标准SV集。在我们的基准实验中,RegionIndel优于几种基于组装和比对的SV调用工具。组装所有插入缺失SV后,对于10x连接读长,RegionIndel在缺失检测方面的总体F1分数为74.8%,插入检测方面为61.8%;对于stLFR连接读长,缺失检测方面为64.3%,插入检测方面为36.7%。此外,对于10x和stLFR连接读长,其总体基因分型准确率分别为83.6%和80.8%。
RegionIndel可实现二倍体组装并检测每个目标区域中的插入缺失SV。定相的二倍体重叠群可进一步使我们能够研究同一单倍型中附近的连锁单核苷酸多态性(SNP)和小插入缺失的插入缺失SV。