Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Mariscal Sucre Avenue, Quito, 170129, Ecuador.
RNASA-IMEDIR, Computer Science Faculty, University of Coruna, Coruna, 15071, Spain.
Sci Rep. 2020 May 22;10(1):8515. doi: 10.1038/s41598-020-65584-y.
Breast cancer (BC) is a heterogeneous disease where genomic alterations, protein expression deregulation, signaling pathway alterations, hormone disruption, ethnicity and environmental determinants are involved. Due to the complexity of BC, the prediction of proteins involved in this disease is a trending topic in drug design. This work is proposing accurate prediction classifier for BC proteins using six sets of protein sequence descriptors and 13 machine-learning methods. After using a univariate feature selection for the mix of five descriptor families, the best classifier was obtained using multilayer perceptron method (artificial neural network) and 300 features. The performance of the model is demonstrated by the area under the receiver operating characteristics (AUROC) of 0.980 ± 0.0037, and accuracy of 0.936 ± 0.0056 (3-fold cross-validation). Regarding the prediction of 4,504 cancer-associated proteins using this model, the best ranked cancer immunotherapy proteins related to BC were RPS27, SUPT4H1, CLPSL2, POLR2K, RPL38, AKT3, CDK3, RPS20, RASL11A and UBTD1; the best ranked metastasis driver proteins related to BC were S100A9, DDA1, TXN, PRNP, RPS27, S100A14, S100A7, MAPK1, AGR3 and NDUFA13; and the best ranked RNA-binding proteins related to BC were S100A9, TXN, RPS27L, RPS27, RPS27A, RPL38, MRPL54, PPAN, RPS20 and CSRP1. This powerful model predicts several BC-related proteins that should be deeply studied to find new biomarkers and better therapeutic targets. Scripts can be downloaded at https://github.com/muntisa/neural-networks-for-breast-cancer-proteins.
乳腺癌 (BC) 是一种异质性疾病,涉及基因组改变、蛋白质表达失调、信号通路改变、激素紊乱、种族和环境决定因素。由于 BC 的复杂性,预测涉及该疾病的蛋白质是药物设计中的一个热门话题。本工作使用六组蛋白质序列描述符和 13 种机器学习方法,为 BC 蛋白质提出了准确的预测分类器。在对五种描述符家族的混合物进行单变量特征选择后,使用多层感知器方法 (人工神经网络) 和 300 个特征获得了最佳分类器。该模型的性能通过接收器工作特征 (AUROC) 的 0.980 ± 0.0037 和 0.936 ± 0.0056(3 倍交叉验证)的面积来证明。关于使用该模型预测 4504 种癌症相关蛋白,与 BC 相关的最佳排名癌症免疫治疗蛋白为 RPS27、SUPT4H1、CLPSL2、POLR2K、RPL38、AKT3、CDK3、RPS20、RASL11A 和 UBTD1;与 BC 相关的最佳排名转移驱动蛋白为 S100A9、DDA1、TXN、PRNP、RPS27、S100A14、S100A7、MAPK1、AGR3 和 NDUFA13;与 BC 相关的最佳排名 RNA 结合蛋白为 S100A9、TXN、RPS27L、RPS27、RPS27A、RPL38、MRPL54、PPAN、RPS20 和 CSRP1。该强大的模型预测了几种与 BC 相关的蛋白质,应深入研究这些蛋白质以寻找新的生物标志物和更好的治疗靶点。脚本可在 https://github.com/muntisa/neural-networks-for-breast-cancer-proteins 下载。