DiRienzo A Gregory, DeGruttola Victor, Larder Brendan, Hertogs Kurt
Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USA.
Stat Med. 2003 Sep 15;22(17):2785-98. doi: 10.1002/sim.1516.
Medical management of HIV infection requires an understanding of the relationship between viral genetic sequences and viral susceptibility to antiretroviral drugs. Because of the high dimensionality of the data on viral genotype, traditional statistical methods are not well suited for investigating this relationship. We develop non-parametric methods specifically for the setting where high-dimensional data provides a basis for predicting a low-dimensional response variable. Our non-recursive methods proceed in three stages: (i) build models, in a forward-stepwise manner, that predict phenotype response from genotype sequence; (ii) identify specific patterns of amino acid sequence that are most influential in predicting phenotype, and (iii) identify combinations of codons that have either a concordant or a discordant association in the occurrence of a mutation. The methods are applied to a data set provided by the Virco Group that contains protease genome sequences and IC50 measurements on a drug from the protease inhibitor class, amprenavir, for 2747 patient samples. From these methods, we were able to identify eight codons from the protease region of the HIV genome that predict resistance to amprenavir, and to determine pairs of codons that tend either to occur together or to preclude the occurrence of the other member of the pair.
对HIV感染的医学管理需要了解病毒基因序列与病毒对抗逆转录病毒药物敏感性之间的关系。由于病毒基因型数据具有高维度,传统统计方法不太适合研究这种关系。我们专门针对高维数据为预测低维响应变量提供基础的情况开发了非参数方法。我们的非递归方法分三个阶段进行:(i)以前向逐步方式构建从基因型序列预测表型响应的模型;(ii)识别在预测表型中最具影响力的氨基酸序列的特定模式,以及(iii)识别在突变发生时具有一致或不一致关联的密码子组合。这些方法应用于Virco集团提供的一个数据集,该数据集包含2747个患者样本的蛋白酶基因组序列以及对蛋白酶抑制剂类药物安普那韦的IC50测量值。通过这些方法,我们能够从HIV基因组的蛋白酶区域识别出八个预测对安普那韦耐药性的密码子,并确定倾向于一起出现或排除该对中另一个成员出现的密码子对。