Universidad San Francisco de Quito (USFQ), Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas, Av. Interoceánica Km 12 ½ -Cumbayá, Quito 170157, Ecuador; Grupo GINUMED, Corporacion Universitaria Rafal Nuñez. Facultad de Salud. Programa de Medicina, Cartagena, Colombia; Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, Valencia, Spain.
Universidad San Francisco de Quito (USFQ), Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas, Av. Interoceánica Km 12 ½ -Cumbayá, Quito 170157, Ecuador; Universidad San Francisco de Quito (USFQ), Grupo de Química Computacional y Teórica, Departamento de Ingeniería Química, Diego de Robles y vía Interoceánica, Quito, 170157, Pichincha, Ecuador.
J Theor Biol. 2020 Jan 21;485:110039. doi: 10.1016/j.jtbi.2019.110039. Epub 2019 Oct 4.
Novel 3D protein descriptors based on bilinear, quadratic and linear algebraic maps in R are proposed. The latter employs the k 2-tuple (dis) similarity matrix to codify information related to covalent and non-covalent interactions in these biopolymers. The calculation of the inter-amino acid distances is generalized by using several dis-similarity coefficients, where normalization procedures based on the simple stochastic and mutual probability schemes are applied. A new local-fragment approach based on amino acid-types and amino acid-groups is proposed to characterize regions of interest in proteins. Topological and geometric macromolecular cutoffs are defined using local and total indices to highlight non-covalent interactions existing between the side-chains of each amino acid. Moreover, local and total indices calculations are generalized considering a LEGO approach, by using several aggregation operators. Collinearity and variability analyses are performed to evaluate every generalizing component applied to the definition of these novel indices. These experiments are oriented to reduce the number of MDs obtained for performing prediction models. The predictive power of the proposed indices was evaluated using two benchmark datasets, folding rate and secondary structural classification of proteins. The proposed MDs are modeled using the following strategies: Multiple Linear Regression (MLR) and Support Vector Machine (SVM), respectively. The best regression model developed for the folding rate of proteins yields a cross-validation coefficient of 0.875 (Test Set) and the best model developed for secondary structural classification obtained 98% of instances correctly classified (Test Set). These statistical parameters are superior to the ones obtained with existing MDs reported in the literature. Overall, the new theoretical generalization enhanced the information extraction into the MDs, allowing a better correlation between these two evaluated benchmark datasets and the proposed indices. The optimal theoretical configurations defined for the calculation of these MDs consider low collinearity and less information redundancy among them. These theoretical configurations and the software are available at http://tomocomd.com/mulims-mcompas.
提出了基于双线性、二次和线性代数映射的新型 3D 蛋白质描述符。后者采用 k 2 元组(不)相似矩阵来编码这些生物聚合物中涉及共价和非共价相互作用的信息。通过使用几种不相似系数,广义计算氨基酸间的距离,其中应用了基于简单随机和相互概率方案的归一化程序。提出了一种新的基于氨基酸类型和氨基酸组的局部片段方法来描述蛋白质中的感兴趣区域。使用局部和总指数定义拓扑和几何大分子截止值,以突出每个氨基酸侧链之间存在的非共价相互作用。此外,通过使用几种聚合运算符,广义化了局部和总指数的计算。进行共线性和可变性分析,以评估应用于这些新型指数定义的每个概括组件。这些实验旨在减少进行预测模型所需的 MD 的数量。使用两个基准数据集(蛋白质折叠率和二级结构分类)评估所提出的指数的预测能力。使用以下策略对所提出的 MD 进行建模:多元线性回归(MLR)和支持向量机(SVM)。为蛋白质折叠率开发的最佳回归模型的交叉验证系数为 0.875(测试集),为二级结构分类开发的最佳模型获得了 98%的实例正确分类(测试集)。这些统计参数优于文献中报道的现有 MD 获得的参数。总体而言,新的理论概括增强了 MD 中的信息提取,允许更好地关联这两个评估的基准数据集和所提出的指数。为计算这些 MD 定义的最佳理论配置考虑了低共线性和它们之间的信息冗余较少。这些理论配置和软件可在 http://tomocomd.com/mulims-mcompas 上获得。