Sing Tobias, Low Andrew J, Beerenwinkel Niko, Sander Oliver, Cheung Peter K, Domingues Francisco S, Büch Joachim, Däumer Martin, Kaiser Rolf, Lengauer Thomas, Harrigan P Richard
Max Planck Institute for Informatics, Saarbrücken, Germany.
Antivir Ther. 2007;12(7):1097-106.
We compared several statistical learning methods for the prediction of HIV coreceptor use from clonal HIV third hypervariable (V3) loop sequences, and evaluated and improved their effectiveness on clinical samples.
Support vector machines (SVM), artificial neural networks, position-specific scoring matrices (PSSM) and mixtures of localized rules were estimated and tested using 10x ten-fold cross-validation on a clonal dataset consisting of 1,100 matched clonal genotype-phenotype pairs from 332 patients. Different SVMs were also trained and tested on a clinically derived dataset, representing 920 patient samples from British Columbia, Canada. Methods were evaluated using receiver operating characteristic (ROC) curves.
In the clonal analysis, the sensitivity of the 11/25 rule at 92.5% specificity was 59.5%. PSSMs and SVMs increased sensitivity to 71.9% and 76.4%, respectively, at the same specificity (P < < 0.05). In clinical samples, the sensitivity of the 11/25 rule and SVM decreased to 25.9% (specificity 93.9%) and 39.8% (specificity 93.5%), respectively. However, the integration of clinical data resulted in a further 2.4-fold increase in sensitivity over the 11/25 rule (63%). Univariate analyses identified 41 V3 mutations significantly associated with coreceptor usage.
For all methods tested, a substantial sensitivity decrease is observed on clinical data, probably owing to the heterogeneity of the viral population in vivo. In response to these complications, we present an SVM-based approach that integrates sequence information with clinical and host data, resulting in improved performance and sensitivity compared with purely sequence-based approaches.
我们比较了几种统计学习方法,用于从克隆的HIV第三高变区(V3)环序列预测HIV共受体使用情况,并评估并提高了它们在临床样本上的有效性。
使用支持向量机(SVM)、人工神经网络、位置特异性评分矩阵(PSSM)和局部规则混合模型,在一个由来自332名患者的1100对匹配的克隆基因型-表型对组成的克隆数据集上,通过10次十折交叉验证进行估计和测试。不同的SVM也在一个临床来源的数据集上进行训练和测试,该数据集代表了来自加拿大不列颠哥伦比亚省的920份患者样本。使用受试者工作特征(ROC)曲线对方法进行评估。
在克隆分析中,11/25规则在特异性为92.5%时的灵敏度为59.5%。在相同特异性下,PSSM和SVM的灵敏度分别提高到71.9%和76.4%(P << 0.05)。在临床样本中,11/25规则和SVM的灵敏度分别降至25.9%(特异性93.9%)和39.8%(特异性93.5%)。然而,临床数据的整合使灵敏度比11/25规则进一步提高了2.4倍(63%)。单因素分析确定了41个与共受体使用显著相关的V3突变。
对于所有测试方法,在临床数据上观察到灵敏度大幅下降,这可能是由于体内病毒群体的异质性所致。针对这些复杂情况,我们提出了一种基于SVM的方法,该方法将序列信息与临床和宿主数据相结合,与纯基于序列的方法相比,性能和灵敏度得到了提高。