Grupo RNASA-IMEDIR, Department of Computer Science, University of A Coruña, 15071 A Coruña, Spain.
Research Department, Puyo Campus, Universidad Estatal Amazónica, Puyo 160150, Ecuador.
Int J Mol Sci. 2021 Dec 2;22(23):13066. doi: 10.3390/ijms222313066.
The parasite species of genus causes Malaria, which remains a major global health problem due to parasite resistance to available Antimalarial drugs and increasing treatment costs. Consequently, computational prediction of new Antimalarial compounds with novel targets in the proteome of sp. is a very important goal for the pharmaceutical industry. We can expect that the success of the pre-clinical assay depends on the conditions of assay per se, the chemical structure of the drug, the structure of the target protein to be targeted, as well as on factors governing the expression of this protein in the proteome such as genes (Deoxyribonucleic acid, DNA) sequence and/or chromosomes structure. However, there are no reports of computational models that consider all these factors simultaneously. Some of the difficulties for this kind of analysis are the dispersion of data in different datasets, the high heterogeneity of data, etc. In this work, we analyzed three databases ChEMBL (Chemical database of the European Molecular Biology Laboratory), UniProt (Universal Protein Resource), and NCBI-GDV (National Center for Biotechnology Information-Genome Data Viewer) to achieve this goal. The ChEMBL dataset contains outcomes for 17,758 unique assays of potential Antimalarial compounds including numeric descriptors (variables) for the structure of compounds as well as a huge amount of information about the conditions of assays. The NCBI-GDV and UniProt datasets include the sequence of genes, proteins, and their functions. In addition, we also created two partitions (c = c and c = cd) of categorical variables from theChEMBL dataset. These partitions contain variables that encode information about experimental conditions of preclinical assays (c) or about the nature and quality of data (c). These categorical variables include information about 22 parameters of biological activity (c), 28 target proteins (c), and 9 organisms of assay (c), etc. We also created another partition of (c = c) including categorical variables with biological information about the target proteins, genes, and chromosomes. These variables cover32 genes (c), 10 chromosomes (c), gene orientation (c), and 31 protein functions (c). We used a Perturbation-Theory Machine Learning Information Fusion (IFPTML) algorithm to map all this information (from three databases) into and train a predictive model. Shannon's entropy measure Sh (numerical variables) was used to quantify the information about the structure of drugs, protein sequences, gene sequences, and chromosomes in the same information scale. Perturbation Theory Operators (PTOs) with the form of Moving Average (MA) operators have been used to quantify perturbations (deviations) in the structural variables with respect to their expected values for different subsets (partitions) of categorical variables. We obtained three IFPTML models using General Discriminant Analysis (GDA), Classification Tree with Univariate Splits (CTUS), and Classification Tree with Linear Combinations (CTLC). The IFPTML-CTLC presented the better performance with Sensitivity Sn(%) = 83.6/85.1, and Specificity Sp(%) = 89.8/89.7 for training/validation sets, respectively. This model could become a useful tool for the optimization of preclinical assays of new Antimalarial compounds vs. different proteins in the proteome of .
疟原虫属的寄生虫种引起疟疾,由于寄生虫对现有抗疟药物的耐药性以及治疗成本的增加,疟疾仍然是一个主要的全球健康问题。因此,在 sp. 的蛋白质组中寻找新的抗疟化合物靶标并进行计算预测是制药行业的一个非常重要的目标。我们可以预期,临床前检测的成功将取决于检测本身的条件、药物的化学结构、目标蛋白质的结构,以及控制这些蛋白质在蛋白质组中表达的因素,如基因(脱氧核糖核酸,DNA)序列和/或染色体结构。然而,目前还没有报告表明计算模型同时考虑了所有这些因素。这种分析的一些困难在于数据在不同数据集之间的分散,数据的高度异质性等。在这项工作中,我们分析了三个数据库 ChEMBL(欧洲分子生物学实验室的化学数据库)、UniProt(通用蛋白质资源)和 NCBI-GDV(国家生物技术信息中心-基因组数据查看器)来实现这一目标。ChEMBL 数据集包含了 17758 种潜在抗疟化合物的 17758 个独特检测结果,包括化合物结构的数值描述符(变量)以及大量关于检测条件的信息。NCBI-GDV 和 UniProt 数据集包括基因、蛋白质及其功能的序列。此外,我们还从 ChEMBL 数据集创建了两个分类变量(c = c 和 c = cd)分区。这些分区包含编码临床前检测实验条件(c)或数据性质和质量(c)信息的变量。这些分类变量包括 22 个生物学活性参数(c)、28 个目标蛋白(c)和 9 个检测生物(c)等信息。我们还创建了另一个(c = c)分区,其中包含关于目标蛋白、基因和染色体的生物学信息的分类变量。这些变量涵盖了 32 个基因(c)、10 个染色体(c)、基因取向(c)和 31 个蛋白质功能(c)。我们使用了扰动理论机器学习信息融合(IFPTML)算法将所有这些信息(来自三个数据库)映射并训练预测模型。Shannon 熵测度 Sh(数值变量)用于量化药物、蛋白质序列、基因序列和染色体结构信息在同一信息尺度上的信息。使用形式为移动平均(MA)算子的扰动理论算子(PTO)来量化不同分类变量子集(分区)中结构变量的扰动(偏差)。我们使用广义判别分析(GDA)、单变量分裂分类树(CTUS)和线性组合分类树(CTLC)获得了三个 IFPTML 模型。IFPTML-CTLC 的性能更好,训练集的灵敏度 Sn(%)=83.6/85.1,验证集的特异性 Sp(%)=89.8/89.7。该模型可以成为优化针对疟原虫属蛋白质组中不同蛋白质的新型抗疟化合物临床前检测的有用工具。