Nowicki Marek, Mroczek Magdalena, Mukhedkar Dhananjay, Bała Piotr, Nikolai Pimenoff Ville, Arroyo Mühr Laila Sara
Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland.
Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, ul. Chopina 12/18, PL-87-100 Toruń, Poland.
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf155.
Human papillomaviruses (HPVs) are among the most diverse viral families that infect humans. Fortunately, only a small number of closely related HPV types affect human health, most notably by causing nearly all cervical cancers, as well as some oral and other anogenital cancers, particularly when infections with high-risk HPV types become persistent. Numerous viral polymerase chain reaction-based diagnostic methods as well as sequencing protocols have been developed for accurate, rapid, and efficient HPV genotyping. However, due to the large number of closely related HPV genotypes and the abundance of nonviral DNA in human derived biological samples, it can be challenging to correctly detect HPV genotypes using high throughput deep sequencing. Here, we introduce a novel HPV detection algorithm, HPV-KITE (HPV K-mer Index Tversky Estimator), which leverages k-mer data analysis and utilizes Tversky indexing for DNA and RNA sequence data. This method offers a rapid and sensitive alternative for detecting HPV from both metagenomic and transcriptomic datasets. We assessed HPV-KITE using three previously analyzed HPV infection-related datasets, comprising a total of 1430 sequenced human samples. For benchmarking, we compared our method's performance with standard HPV sequencing analysis algorithms, including general sequence-based mapping, and k-mer-based classification methods. Parallelization demonstrated fast processing times achieved through shingling, and scalability analysis revealed optimal performance when employing multiple nodes. Our results showed that HPV-KITE is one of the fastest, most accurate, and easiest ways to detect HPV genotypes from virtually any next-generation sequencing data. Moreover, the method is also highly scalable and available to be optimized for any microorganism other than HPV.
人乳头瘤病毒(HPV)是感染人类的最多样化的病毒家族之一。幸运的是,只有少数密切相关的HPV类型会影响人类健康,最显著的是几乎引发所有宫颈癌,以及一些口腔癌和其他肛门生殖器癌,尤其是当高危HPV类型的感染持续存在时。已经开发了许多基于病毒聚合酶链反应的诊断方法以及测序方案,用于准确、快速和高效的HPV基因分型。然而,由于大量密切相关的HPV基因型以及人类生物样本中丰富的非病毒DNA,使用高通量深度测序正确检测HPV基因型可能具有挑战性。在这里,我们介绍一种新颖的HPV检测算法,HPV-KITE(HPV k-mer索引特沃斯基估计器),它利用k-mer数据分析并将特沃斯基索引用于DNA和RNA序列数据。该方法为从宏基因组和转录组数据集中检测HPV提供了一种快速且灵敏的替代方法。我们使用三个先前分析过的与HPV感染相关的数据集评估了HPV-KITE,这些数据集总共包含1430个测序的人类样本。为了进行基准测试,我们将我们方法的性能与标准HPV测序分析算法进行了比较,包括基于一般序列的映射和基于k-mer的分类方法。并行化展示了通过分块实现的快速处理时间,并且可扩展性分析揭示了使用多个节点时的最佳性能。我们的结果表明,HPV-KITE是从几乎任何下一代测序数据中检测HPV基因型的最快、最准确和最简单的方法之一。此外,该方法还具有高度可扩展性,并且可针对除HPV之外的任何微生物进行优化。