Zhang Shunpu, Li Zhong, Beland Kevin, Lu Guoqing
Department of Statistics, University of Central Florida, Orlando, FL, 32816, USA.
College of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, China.
BMC Bioinformatics. 2016 Jul 21;17:287. doi: 10.1186/s12859-016-1147-x.
Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results.
We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty.
We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.
聚类是分子生物学家用于对同源序列进行分组并研究进化的常用技术。仍然存在一些问题,例如如何准确地对分子序列进行聚类,特别是如何评估聚类结果的确定性。
我们提出了一种基于模型的聚类方法来分析分子序列,描述了一种子集重抽样方案来评估聚类的确定性,并展示了一种使用三维可视化来检查聚类的直观方法。我们应用上述方法分析了流感病毒血凝素(HA)序列。对于高致病性H5N1禽流感,估计有九个聚类,这与先前的发现一致。给定序列能够正确分配到一个聚类的确定性均为1.0,而给定聚类的确定性也非常高(0.92 - 1.0),总体聚类确定性为0.95。对于甲型H7流感病毒,估计有十个HA聚类并且绝大多数序列能够以超过0.99的确定性分配到一个聚类中。然而,聚类的确定性在0.40至0.98之间变化;这种确定性的变化可能归因于不同聚类中序列数据的异质性。在这两种情况下,使用子集重抽样方法估计的确定性值均高于基于标准重抽样方法计算的值,表明我们的重抽样方案适用于聚类确定性的估计。
我们制定了一种带有确定性估计和序列数据三维可视化的聚类分析方法。我们分析了两组甲型流感HA序列,结果表明我们的方法适用于流感病毒序列的聚类分析。