Department of Computer Science, The University of Hong Kong, Pok Fu Lam Road, Hong Kong, 999077, China.
Faculty of Computing, Harbin Institute of Technology, 92 Xidazhi Street, Nangang District, Harbin, Heilongjiang 150001, China.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae613.
Ensuring a unified variant representation aligning the sequencing data is critical for downstream analysis as variant representation may differ across platforms and sequencing conditions. Current approaches typically treat variant unification as a post-step following variant calling and are incapable of measuring the correct variant representation from the outset. Aligning variant representations with the alignment before variant calling has benefits like providing reliable training labels for deep learning-based variant caller model training and enabling direct assessment of alignment quality. However, it also poses challenges due to the large number of candidates to handle. Here, we present Repun, a haplotype-aware variant-alignment unification algorithm that harmonizes the variant representation between provided variants and alignments in different sequencing platforms. Repun leverages phasing to facilitate equivalent haplotype matches between variants and alignments. Our approach reduced the comparisons between variant haplotypes and candidate haplotypes by utilizing haplotypes with read evidence to speed up the unification process. Repun achieved >99.99% precision and > 99.5% recall through extensive evaluations of various Genome in a Bottle Consortium samples encompassing three sequencing platforms: Oxford Nanopore Technology, Pacific Biosciences, and Illumina. Repun is open-source and available at (https://github.com/zhengzhenxian/Repun).
确保测序数据的变异体表示一致对于下游分析至关重要,因为变异体表示可能因平台和测序条件而异。目前的方法通常将变异体统一视为变异体调用后的后处理步骤,无法从一开始就测量正确的变异体表示。在进行变异体调用之前对齐变异体表示具有一些优势,例如为基于深度学习的变异体调用模型训练提供可靠的训练标签,并能够直接评估对齐质量。然而,由于需要处理的候选数量庞大,这也带来了挑战。在这里,我们提出了 Repun,这是一种基于单倍型的变异体对齐统一算法,可协调不同测序平台中提供的变异体和比对之间的变异体表示。Repun 利用相位信息来促进变异体和比对之间的等效单倍型匹配。我们的方法通过利用具有读取证据的单倍型来加快统一过程,从而减少了变异体单倍型和候选单倍型之间的比较。通过对涵盖三个测序平台(Oxford Nanopore Technology、Pacific Biosciences 和 Illumina)的各种 Genome in a Bottle 联盟样本进行广泛评估,Repun 实现了 >99.99%的精度和 >99.5%的召回率。Repun 是开源的,并可在 (https://github.com/zhengzhenxian/Repun) 上获得。