Herrera-Ibatá Diana María, Pazos Alejandro, Orbegozo-Medina Ricardo Alfredo, Romero-Durán Francisco Javier, González-Díaz Humberto
Department of Information and Communication Technologies, University of A Coruña (UDC), 15071 A Coruña, Spain.
Department of Information and Communication Technologies, University of A Coruña (UDC), 15071 A Coruña, Spain.
Biosystems. 2015 Jun;132-133:20-34. doi: 10.1016/j.biosystems.2015.04.007. Epub 2015 Apr 24.
Using computational algorithms to design tailored drug cocktails for highly active antiretroviral therapy (HAART) on specific populations is a goal of major importance for both pharmaceutical industry and public health policy institutions. New combinations of compounds need to be predicted in order to design HAART cocktails. On the one hand, there are the biomolecular factors related to the drugs in the cocktail (experimental measure, chemical structure, drug target, assay organisms, etc.); on the other hand, there are the socioeconomic factors of the specific population (income inequalities, employment levels, fiscal pressure, education, migration, population structure, etc.) to study the relationship between the socioeconomic status and the disease. In this context, machine learning algorithms, able to seek models for problems with multi-source data, have to be used. In this work, the first artificial neural network (ANN) model is proposed for the prediction of HAART cocktails, to halt AIDS on epidemic networks of U.S. counties using information indices that codify both biomolecular and several socioeconomic factors. The data was obtained from at least three major sources. The first dataset included assays of anti-HIV chemical compounds released to ChEMBL. The second dataset is the AIDSVu database of Emory University. AIDSVu compiled AIDS prevalence for >2300 U.S. counties. The third data set included socioeconomic data from the U.S. Census Bureau. Three scales or levels were employed to group the counties according to the location or population structure codes: state, rural urban continuum code (RUCC) and urban influence code (UIC). An analysis of >130,000 pairs (network links) was performed, corresponding to AIDS prevalence in 2310 counties in U.S. vs. drug cocktails made up of combinations of ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4856 protocols, and 10 possible experimental measures. The best model found with the original data was a linear neural network (LNN) with AUROC>0.80 and accuracy, specificity, and sensitivity≈77% in training and external validation series. The change of the spatial and population structure scale (State, UIC, or RUCC codes) does not affect the quality of the model. Unbalance was detected in all the models found comparing positive/negative cases and linear/non-linear model accuracy ratios. Using synthetic minority over-sampling technique (SMOTE), data pre-processing and machine-learning algorithms implemented into the WEKA software, more balanced models were found. In particular, a multilayer perceptron (MLP) with AUROC=97.4% and precision, recall, and F-measure >90% was found.
使用计算算法为特定人群设计用于高效抗逆转录病毒疗法(HAART)的定制药物鸡尾酒,这对制药行业和公共卫生政策机构而言都是极为重要的目标。为了设计HAART鸡尾酒,需要预测化合物的新组合。一方面,存在与鸡尾酒中的药物相关的生物分子因素(实验测量、化学结构、药物靶点、测定生物等);另一方面,存在特定人群的社会经济因素(收入不平等、就业水平、财政压力、教育、移民、人口结构等),以研究社会经济状况与疾病之间的关系。在这种背景下,必须使用能够为多源数据问题寻找模型的机器学习算法。在这项工作中,提出了第一个用于预测HAART鸡尾酒的人工神经网络(ANN)模型,以利用编码生物分子和多种社会经济因素的信息指数,在美国各县的流行网络上遏制艾滋病。数据至少来自三个主要来源。第一个数据集包括发布到ChEMBL的抗HIV化合物的测定。第二个数据集是埃默里大学的AIDSVu数据库。AIDSVu汇编了美国2300多个县的艾滋病患病率。第三个数据集包括来自美国人口普查局的社会经济数据。根据位置或人口结构代码,采用三个尺度或级别对县进行分组:州、农村城市连续体代码(RUCC)和城市影响代码(UIC)。对超过130,000对(网络链接)进行了分析,对应于美国2310个县的艾滋病患病率与由21,582种独特药物、9种病毒或人类蛋白质靶点、4856种方案和10种可能的实验测量结果组合而成的药物鸡尾酒。使用原始数据找到的最佳模型是线性神经网络(LNN),在训练和外部验证系列中,其曲线下面积(AUROC)>0.80,准确率、特异性和灵敏度约为77%。空间和人口结构尺度(州、UIC或RUCC代码)的变化不会影响模型的质量。在所有找到的模型中,比较阳性/阴性病例和线性/非线性模型准确率比率时检测到不平衡。使用合成少数过采样技术(SMOTE)、数据预处理以及在WEKA软件中实现的机器学习算法,找到了更平衡的模型。特别是,发现了一个多层感知器(MLP),其AUROC = 97.4%,精确率、召回率和F值>90%。