Department of Computer Science and Georgia State University, Atlanta, Georgia, USA.
Department of Mathematics and Statistics, Georgia State University, Atlanta, Georgia, USA.
J Comput Biol. 2023 Sep;30(9):1009-1018. doi: 10.1089/cmb.2023.0154. Epub 2023 Sep 11.
Identifying viral variants through clustering is essential for understanding the composition and structure of viral populations within and between hosts, which play a crucial role in disease progression and epidemic spread. This article proposes and validates novel Monte Carlo (MC) methods for clustering aligned viral sequences by minimizing either entropy or Hamming distance from consensuses. We validate these methods on four benchmarks: two SARS-CoV-2 interhost data sets and two HIV intrahost data sets. A parallelized version of our tool is scalable to very large data sets. We show that both entropy and Hamming distance-based MC clusterings discern the meaningful information from sequencing data. The proposed clustering methods consistently converge to similar clusterings across different runs. Finally, we show that MC clustering improves reconstruction of intrahost viral population from sequencing data.
通过聚类来识别病毒变体对于理解宿主内和宿主间病毒群体的组成和结构至关重要,这些结构在疾病进展和疫情传播中起着关键作用。本文提出并验证了通过最小化共识的熵或汉明距离来对对齐的病毒序列进行聚类的新型蒙特卡罗 (MC) 方法。我们在四个基准上验证了这些方法:两个 SARS-CoV-2 宿主间数据集和两个 HIV 宿主内数据集。我们工具的并行版本可扩展到非常大的数据集。我们表明,基于熵和汉明距离的 MC 聚类都可以从测序数据中辨别出有意义的信息。所提出的聚类方法在不同的运行中始终收敛到相似的聚类。最后,我们表明 MC 聚类可改善从测序数据重建宿主内病毒群体。