Yin Junming, Beerenwinkel Niko, Rahnenführer Jörg, Lengauer Thomas
Department of EECS, University of California, Berkeley, CA, USA.
Stat Appl Genet Mol Biol. 2006;5:Article17. doi: 10.2202/1544-6115.1164. Epub 2006 Jun 23.
The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.
HIV耐药性的演变以HIV基因组中耐药相关突变的积累为特征。诱变树是一类受限的贝叶斯树模型,已被用于推断这些突变发生的顺序和速率。理解和预测这一进化过程是合理设计抗逆转录病毒疗法的重要前提。在实际应用中,K个诱变树的混合模型提供了更大的灵活性,通常更适合对观察到的突变模式进行建模。在此,我们研究K个诱变树混合模型的模型选择问题。我们评估了几种经典的模型选择标准,包括交叉验证、贝叶斯信息准则(BIC)和赤池信息准则。我们还通过为诱变树混合模型的参数构建先验概率分布并推导模型的后验概率来使用经验贝叶斯方法。除了模型维度,我们还考虑了混合模型的冗余性,它通过比较混合模型内树的拓扑结构来衡量。基于冗余性,我们提出了一种新的模型选择标准,它是对BIC的一种修改。在模拟的HIV数据和真实的HIV数据上的实验结果表明,经典标准往往会选择树组件过多的模型。只有交叉验证和修改后的BIC在大多数情况下能够恢复正确的树数量和树拓扑结构。在相同的最佳性能下,新的BIC修改版本的运行时间大约低一个数量级。因此,这种模型选择标准也可用于交叉验证在计算上不可行的大数据集。