Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Campus UAB, 08193 Bellaterra, Barcelona, Spain.
Computer Science Department, Technical University of Catalonia, Carrer de Jordi Girona 1-3, 08034, Barcelona, Spain.
BMC Bioinformatics. 2019 Jul 30;20(1):410. doi: 10.1186/s12859-019-2991-2.
Antiretroviral drugs are a very effective therapy against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to the drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for an optimum medical treatment. In this paper, we propose the use of weighted categorical kernel functions to predict drug resistance from virus sequence data. These kernel functions are very simple to implement and are able to take into account HIV data particularities, such as allele mixtures, and to weigh the different importance of each protein residue, as it is known that not all positions contribute equally to the resistance.
We analyzed 21 drugs of four classes: protease inhibitors (PI), integrase inhibitors (INI), nucleoside reverse transcriptase inhibitors (NRTI) and non-nucleoside reverse transcriptase inhibitors (NNRTI). We compared two categorical kernel functions, Overlap and Jaccard, against two well-known noncategorical kernel functions (Linear and RBF) and Random Forest (RF). Weighted versions of these kernels were also considered, where the weights were obtained from the RF decrease in node impurity. The Jaccard kernel was the best method, either in its weighted or unweighted form, for 20 out of the 21 drugs.
Results show that kernels that take into account both the categorical nature of the data and the presence of mixtures consistently result in the best prediction model. The advantage of including weights depended on the protein targeted by the drug. In the case of reverse transcriptase, weights based in the relative importance of each position clearly increased the prediction performance, while the improvement in the protease was much smaller. This seems to be related to the distribution of weights, as measured by the Gini index. All methods described, together with documentation and examples, are freely available at https://bitbucket.org/elies_ramon/catkern.
抗逆转录病毒药物是对抗 HIV 感染的非常有效的疗法。然而,HIV 的高突变率允许出现能够抵抗药物治疗的变体。因此,预测以前未观察到的变体的药物耐药性对于最佳治疗非常重要。在本文中,我们提出使用加权分类核函数来从病毒序列数据预测药物耐药性。这些核函数非常易于实现,并且能够考虑到 HIV 数据的特殊性,例如等位基因混合物,并权衡每个蛋白质残基的不同重要性,因为已知并非所有位置都对耐药性同等贡献。
我们分析了四类 21 种药物:蛋白酶抑制剂(PI)、整合酶抑制剂(INI)、核苷逆转录酶抑制剂(NRTI)和非核苷逆转录酶抑制剂(NNRTI)。我们比较了两种分类核函数,重叠和 Jaccard,以及两种众所周知的非分类核函数(线性和 RBF)和随机森林(RF)。还考虑了这些核函数的加权版本,其中权重是从 RF 在节点杂质减少中获得的。在 21 种药物中的 20 种中,无论是加权还是非加权形式,Jaccard 核都是最好的方法。
结果表明,既考虑数据的分类性质又考虑混合物存在的核函数始终导致最佳预测模型。包含权重的优势取决于药物针对的蛋白质。在逆转录酶的情况下,基于每个位置相对重要性的权重明显提高了预测性能,而蛋白酶的改善则要小得多。这似乎与权重的分布有关,如基尼指数所衡量的那样。所有描述的方法,以及文档和示例,都可在 https://bitbucket.org/elies_ramon/catkern 上免费获得。