School of Electronics and Information, Northwestern Polytechnical University, Xi'an 710129, China.
School of Electronics and Information, Northwestern Polytechnical University, Xi'an 710129, China.
J Proteomics. 2023 May 30;280:104895. doi: 10.1016/j.jprot.2023.104895. Epub 2023 Apr 5.
The Cancer Proteome Atlas (TCPA) project collects reverse-phase protein arrays (RPPA)-based proteome datasets from nearly 8000 samples across 32 cancer types. This study aims to investigate the pan-cancer proteome signature and identify cancer subtypes of glioma, kidney cancer, and lung cancer based on TCPA data. We first visualized the tumor clustering models using t-distributed stochastic neighbour embedding (t-SNE) and bi-clustering heatmap. Then, three feature selection methods (pyHSICLasso, XGBoost, and Random Forest) were performed to select protein features for classifying cancer subtypes in training dataset, and the LibSVM algorithm was empolyed to test classification accuracy in the validation dataset. Clustering analysis revealed that different kinds of tumors have relatively distinct proteomic profiling based on tissue or origin. We identified 20, 10, and 20 protein features with the highest accuracies in classifying subtypes of glioma, kidney cancer, and lung cancer, respectively. The predictive abilities of the selected proteins were confirmed by receiving operating characteristic (ROC) analysis. Finally, the Bayesian network was utilized to explore the protein biomarkers that have direct causal relationships with cancer subtypes. Overall, we highlight the theoretical and technical applications of machine learning based feature selection approaches in the analysis of high-throughput biological data, particularly for cancer biomarker research. SIGNIFICANCE: Functional proteomics is a powerful approach for characterizing cell signaling pathways and understanding their phenotypic effects on cancer development. The TCPA database provides a platform to explore and analyze TCGA pan-cancer RPPA-based protein expression. With the advent of the RPPA technology, the availability of high-throughput data in TCPA platform has made it possible to use machine learning methods to identify protein biomarkers and further differentiate subtypes of cancer based on proteomic data. In this study, we highlight the role of feature selection and Bayesian network in discovery protein biomarker for classifying cancer subtypes based on functional proteomic data. The application of machine learning methods in the analysis of high-throughput biological data, particularly for cancer biomarker researches, which have potential clinical values in developing individualized treatment strategies.
癌症蛋白质组图谱 (TCPA) 项目收集了来自 32 种癌症类型近 8000 个样本的基于反相蛋白质阵列 (RPPA) 的蛋白质组数据集。本研究旨在基于 TCPA 数据调查泛癌症蛋白质组特征,并鉴定脑癌、肾癌和肺癌的癌症亚型。我们首先使用 t 分布随机邻域嵌入 (t-SNE) 和双聚类热图可视化肿瘤聚类模型。然后,使用三种特征选择方法(pyHSICLasso、XGBoost 和随机森林)在训练数据集上选择用于分类癌症亚型的蛋白质特征,并用 LibSVM 算法在验证数据集上测试分类准确性。聚类分析表明,不同种类的肿瘤根据组织或起源具有相对独特的蛋白质组特征。我们分别确定了 20、10 和 20 种具有最高分类准确率的蛋白质特征,用于分类脑癌、肾癌和肺癌的亚型。通过接收者操作特征 (ROC) 分析验证了所选蛋白质的预测能力。最后,利用贝叶斯网络探索与癌症亚型具有直接因果关系的蛋白质生物标志物。总的来说,我们强调了基于机器学习的特征选择方法在分析高通量生物数据中的理论和技术应用,特别是在癌症生物标志物研究中。
功能蛋白质组学是一种强大的方法,用于描述细胞信号通路并了解它们对癌症发展的表型影响。TCPA 数据库提供了一个平台,用于探索和分析 TCGA 基于 RPPA 的泛癌症蛋白质表达。随着 RPPA 技术的出现,TCPA 平台中高通量数据的可用性使得可以使用机器学习方法来识别蛋白质生物标志物,并进一步根据蛋白质组数据区分癌症亚型。在本研究中,我们强调了特征选择和贝叶斯网络在发现基于功能蛋白质组学数据分类癌症亚型的蛋白质生物标志物中的作用。机器学习方法在分析高通量生物数据中的应用,特别是在癌症生物标志物研究中,在开发个体化治疗策略方面具有潜在的临床价值。