Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK.
Department of Microbiology, New York University School of Medicine, NY 10016, USA.
Nucleic Acids Res. 2019 Jun 20;47(11):5539-5549. doi: 10.1093/nar/gkz361.
We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analyzing an alignment of over 110 000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a diverse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at https://github.com/gtonkinhill/fastbaps.
我们提出了 fastbaps,这是一种解决遗传聚类问题的快速方法。Fastbaps 可以快速确定对Dirichlet 过程混合模型(DPM)的近似拟合,以对多基因座基因型数据进行聚类。我们的高效基于模型的聚类方法能够对现有基于模型的方法聚类 10-100 倍以上的数据集,我们通过分析超过 110,000 个 HIV-1 pol 基因序列的对齐来证明这一点。我们还提供了一种快速划分现有层次结构的方法,以最大化 DPM 模型边际似然,从而允许我们使用群体基因组模型将系统发育树划分为分支和亚分支。对模拟数据以及各种真实细菌和病毒数据集的广泛测试表明,fastbaps 为以前的基于模型的方法提供了可比或改进的解决方案,同时速度也显著提高。该方法以 MIT 许可证的形式免费提供,作为一个易于使用的 R 包,可在 https://github.com/gtonkinhill/fastbaps 上获得。