Fuhrmann Lara, Langer Benjamin, Topolsky Ivan, Beerenwinkel Niko
Department of Biosystems Science and Engineering, ETH Zurich, Klingelbergstrasse 48, Basel 4056, Switzerland.
SIB Swiss Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, Lausanne 1015, Switzerland.
NAR Genom Bioinform. 2024 Nov 28;6(4):lqae152. doi: 10.1093/nargab/lqae152. eCollection 2024 Dec.
RNA viruses exist as large heterogeneous populations within their host. The structure and diversity of virus populations affects disease progression and treatment outcomes. Next-generation sequencing allows detailed viral population analysis, but inferring diversity from error-prone reads is challenging. Here, we present VILOCA (VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data), a method for mutation calling and reconstruction of local haplotypes from short- and long-read viral sequencing data. Local haplotypes refer to genomic regions that have approximately the length of the input reads. VILOCA recovers local haplotypes by using a Dirichlet process mixture model to cluster reads around their unobserved haplotypes and leveraging quality scores of the sequencing reads. We assessed the performance of VILOCA in terms of mutation calling and haplotype reconstruction accuracy on simulated and experimental Illumina, PacBio and Oxford Nanopore data. On simulated and experimental Illumina data, VILOCA performed better or similar to existing methods. On the simulated long-read data, VILOCA is able to recover on average [Formula: see text] of the ground truth mutations with perfect precision compared to only [Formula: see text] recall and [Formula: see text] precision of the second-best method. In summary, VILOCA provides significantly improved accuracy in mutation and haplotype calling, especially for long-read sequencing data, and therefore facilitates the comprehensive characterization of heterogeneous within-host viral populations.
RNA病毒在其宿主内以高度异质的群体形式存在。病毒群体的结构和多样性会影响疾病进展和治疗结果。新一代测序技术允许进行详细的病毒群体分析,但从容易出错的读数中推断多样性具有挑战性。在这里,我们提出了VILOCA(用于短读长和长读长数据的病毒局部单倍型重建和突变检测),一种从短读长和长读长病毒测序数据中进行突变检测和局部单倍型重建的方法。局部单倍型是指长度与输入读数大致相同的基因组区域。VILOCA通过使用狄利克雷过程混合模型将读数聚类在其未观察到的单倍型周围,并利用测序读数的质量得分来恢复局部单倍型。我们在模拟和实验的Illumina、PacBio和牛津纳米孔数据上评估了VILOCA在突变检测和单倍型重建准确性方面表现。在模拟和实验的Illumina数据上,VILOCA的表现优于或类似于现有方法。在模拟的长读长数据上,与第二好的方法仅70%的召回率和73%的精确率相比,VILOCA能够以完美的精确率平均恢复92%的真实突变。总之,VILOCA在突变和单倍型检测方面提供了显著提高的准确性,特别是对于长读长测序数据,因此有助于对宿主内异质病毒群体进行全面表征。