Maebashi Institute of Animal Science, Livestock Improvement Association of Japan, Inc., 316 Kanamaru, Maebashi, Gunma, 371-0121, Japan.
BMC Bioinformatics. 2011 Jun 28;12:263. doi: 10.1186/1471-2105-12-263.
A Bayesian approach based on a Dirichlet process (DP) prior is useful for inferring genetic population structures because it can infer the number of populations and the assignment of individuals simultaneously. However, the properties of the DP prior method are not well understood, and therefore, the use of this method is relatively uncommon. We characterized the DP prior method to increase its practical use.
First, we evaluated the usefulness of the sequentially-allocated merge-split (SAMS) sampler, which is a technique for improving the mixing of Markov chain Monte Carlo algorithms. Although this sampler has been implemented in a preceding program, HWLER, its effectiveness has not been investigated. We showed that this sampler was effective for population structure analysis. Implementation of this sampler was useful with regard to the accuracy of inference and computational time. Second, we examined the effect of a hyperparameter for the prior distribution of allele frequencies and showed that the specification of this parameter was important and could be resolved by considering the parameter as a variable. Third, we compared the DP prior method with other Bayesian clustering methods and showed that the DP prior method was suitable for data sets with unbalanced sample sizes among populations. In contrast, although current popular algorithms for population structure analysis, such as those implemented in STRUCTURE, were suitable for data sets with uniform sample sizes, inferences with these algorithms for unbalanced sample sizes tended to be less accurate than those with the DP prior method.
The clustering method based on the DP prior was found to be useful because it can infer the number of populations and simultaneously assign individuals into populations, and it is suitable for data sets with unbalanced sample sizes among populations. Here we presented a novel program, DPART, that implements the SAMS sampler and can consider the hyperparameter for the prior distribution of allele frequencies to be a variable.
基于狄利克雷过程 (DP) 先验的贝叶斯方法可用于推断遗传群体结构,因为它可以同时推断群体的数量和个体的归属。然而,DP 先验方法的性质尚不清楚,因此该方法的使用相对较少。我们对 DP 先验方法进行了特征描述,以增加其实际用途。
首先,我们评估了顺序分配合并-分裂 (SAMS) 抽样器的有用性,这是一种改进马尔可夫链蒙特卡罗算法混合的技术。虽然该抽样器已在前一个程序 HWLER 中实现,但尚未研究其有效性。我们表明,该抽样器对于群体结构分析是有效的。实现该抽样器在推断的准确性和计算时间方面是有用的。其次,我们研究了等位基因频率先验分布的超参数的影响,并表明该参数的指定很重要,可以通过将该参数视为变量来解决。第三,我们将 DP 先验方法与其他贝叶斯聚类方法进行了比较,并表明 DP 先验方法适用于群体间样本大小不平衡的数据集。相比之下,尽管 STRUCTURE 等当前流行的群体结构分析算法适用于样本大小均匀的数据集,但对于不平衡样本大小的推断,这些算法的推断准确性往往不如 DP 先验方法。
基于 DP 先验的聚类方法被发现是有用的,因为它可以推断群体的数量并同时将个体分配到群体中,并且适用于群体间样本大小不平衡的数据集。在这里,我们提出了一个新的程序 DPART,它实现了 SAMS 抽样器,并可以将等位基因频率先验分布的超参数视为变量。