Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR), Singapore.
NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore (NUS), Singapore.
Mol Biol Evol. 2019 Jul 1;36(7):1580-1595. doi: 10.1093/molbev/msz053.
Subspecies nomenclature systems of pathogens are increasingly based on sequence data. The use of phylogenetics to identify and differentiate between clusters of genetically similar pathogens is particularly prevalent in virology from the nomenclature of human papillomaviruses to highly pathogenic avian influenza (HPAI) H5Nx viruses. These nomenclature systems rely on absolute genetic distance thresholds to define the maximum genetic divergence tolerated between viruses designated as closely related. However, the phylogenetic clustering methods used in these nomenclature systems are limited by the arbitrariness of setting intra and intercluster diversity thresholds. The lack of a consensus ground truth to define well-delineated, meaningful phylogenetic subpopulations amplifies the difficulties in identifying an informative distance threshold. Consequently, phylogenetic clustering often becomes an exploratory, ad hoc exercise. Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) was developed to provide a statistically principled phylogenetic clustering framework that negates the need for an arbitrarily defined distance threshold. Using the pairwise patristic distance distributions of an input phylogeny, PhyCLIP parameterizes the intra and intercluster divergence limits as statistical bounds in an integer linear programming model which is subsequently optimized to cluster as many sequences as possible. When applied to the hemagglutinin phylogeny of HPAI H5Nx viruses, PhyCLIP was not only able to recapitulate the current WHO/OIE/FAO H5 nomenclature system but also further delineated informative higher resolution clusters that capture geographically distinct subpopulations of viruses. PhyCLIP is pathogen-agnostic and can be generalized to a wide variety of research questions concerning the identification of biologically informative clusters in pathogen phylogenies. PhyCLIP is freely available at http://github.com/alvinxhan/PhyCLIP, last accessed March 15, 2019.
病原体亚种命名系统越来越多地基于序列数据。从人类乳头瘤病毒的命名到高致病性禽流感(HPAI)H5Nx 病毒,系统发生学被广泛用于识别和区分遗传上相似的病原体聚类。这些命名系统依赖于绝对遗传距离阈值来定义被指定为密切相关的病毒之间可容忍的最大遗传差异。然而,这些命名系统中使用的系统发育聚类方法受到设定内群和外群多样性阈值的任意性限制。缺乏共识的真实情况来定义定义明确、有意义的系统发育亚群,这加剧了确定信息丰富的距离阈值的困难。因此,系统发生聚类通常成为一种探索性的、临时的练习。线性整数规划的系统发生聚类(PhyCLIP)的开发提供了一个具有统计原理的系统发生聚类框架,该框架否定了需要任意定义距离阈值的必要性。使用输入系统发育树的成对亲缘关系距离分布,PhyCLIP 将内群和外群的分歧限制参数化为整数线性规划模型中的统计边界,随后对该模型进行优化,以尽可能多地聚类序列。当应用于 HPAI H5Nx 病毒的血凝素系统发育时,PhyCLIP 不仅能够重现当前的世卫组织/国际兽疫局/粮农组织 H5 命名系统,还进一步划分了具有信息性的更高分辨率聚类,这些聚类捕获了病毒在地理上不同的亚群。PhyCLIP 与病原体无关,可以推广到病原体系统发育中识别具有生物学意义的聚类的各种研究问题。PhyCLIP 可在 http://github.com/alvinxhan/PhyCLIP 上免费获得,最后访问时间为 2019 年 3 月 15 日。