Rabinowitz Matthew, Myers Lance, Banjevic Milena, Chan Albert, Sweetkind-Singer Joshua, Haberer Jessica, McCann Kelly, Wolkowicz Roland
Gene Security Network, Palo Alto, CA, USA.
Bioinformatics. 2006 Mar 1;22(5):541-9. doi: 10.1093/bioinformatics/btk011. Epub 2005 Dec 20.
Genotype-phenotype modeling problems are often overcomplete, or ill-posed, since the number of potential predictors-genes, proteins, mutations and their interactions-is large relative to the number of measured outcomes. Such datasets can still be used to train sparse parameter models that generalize accurately, by exerting a principle similar to Occam's Razor: When many possible theories can explain the observations, the most simple is most likely to be correct. We apply this philosophy to modeling the drug response of Type-1 Human Immunodeficiency Virus (HIV-1). Owing to the decreasing expense of genetic sequencing relative to in vitro phenotype testing, a statistical model that reliably predicts viral drug response from genetic data is an important tool in the selection of antiretroviral therapy (ART). The optimization techniques described will have application to many genotype-phenotype modeling problems for the purpose of enhancing clinical decisions.
We describe two regression techniques for predicting viral phenotype in response to ART from genetic sequence data. Both techniques employ convex optimization for the continuous subset selection of a sparse set of model parameters. The first technique, the least absolute shrinkage and selection operator, uses the l(1) norm loss function to create a sparse linear model; the second, the support vector machine with radial basis kernel functions, uses the epsilon-insensitive loss function to create a sparse non-linear model. The techniques are applied to predict the response of the HIV-1 virus to 10 reverse transcriptase inhibitor and 7 protease inhibitor drugs. The genetic data are derived from the HIV coding sequences for the reverse transcriptase and protease enzymes. When tested by cross-validation with actual laboratory measurements, these models predict drug response phenotype more accurately than models previously discussed in the literature, and other canonical techniques described here. Key features of the methods that enable this performance are the tendency to generate simple models where many of the parameters are zero, and the convexity of the cost function, which assures that we can find model parameters to globally minimize the cost function for a particular training dataset.
Results, tables and figures are available at ftp://ftp.genesecurity.net.
An Appendix to accompany this article is available at Bioinformatics online.
基因型 - 表型建模问题通常是超完备的,或者是不适定的,因为相对于测量结果的数量而言,潜在预测因子(基因、蛋白质、突变及其相互作用)的数量很大。通过运用类似于奥卡姆剃刀的原则,这样的数据集仍可用于训练能准确泛化的稀疏参数模型:当许多可能的理论都能解释观测结果时,最简单的理论最有可能是正确的。我们将这一理念应用于对1型人类免疫缺陷病毒(HIV - 1)药物反应的建模。由于相对于体外表型测试,基因测序成本不断降低,一个能从基因数据可靠预测病毒药物反应的统计模型是抗逆转录病毒疗法(ART)选择中的重要工具。所描述的优化技术将应用于许多基因型 - 表型建模问题,以增强临床决策。
我们描述了两种从基因序列数据预测ART反应中病毒表型的回归技术。两种技术都采用凸优化来对稀疏的一组模型参数进行连续子集选择。第一种技术,即最小绝对收缩和选择算子,使用l(1)范数损失函数创建一个稀疏线性模型;第二种技术,即具有径向基核函数的支持向量机,使用ε - 不敏感损失函数创建一个稀疏非线性模型。这些技术被应用于预测HIV - 1病毒对10种逆转录酶抑制剂和7种蛋白酶抑制剂药物的反应。基因数据源自逆转录酶和蛋白酶的HIV编码序列。当通过与实际实验室测量进行交叉验证测试时,这些模型比文献中先前讨论的模型以及这里描述的其他经典技术更准确地预测药物反应表型。实现这种性能的方法的关键特征是倾向于生成许多参数为零的简单模型,以及成本函数的凸性,这确保我们能够找到模型参数以全局最小化特定训练数据集的成本函数。
结果、表格和图形可在ftp://ftp.genesecurity.net获取。
本文的附录可在《生物信息学》在线版获取。