Suppr超能文献

基于人类蛋白质组微阵列的随机梯度增强方法发现用于肺癌分类的潜在生物标志物。

Discovery of potential biomarkers for lung cancer classification based on human proteome microarrays using Stochastic Gradient Boosting approach.

机构信息

Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China.

Chongqing Center for Disease Control and Prevention, No.8 Changjiang 2nd Street, Yuzhong District, Chongqing, 400042, China.

出版信息

J Cancer Res Clin Oncol. 2023 Aug;149(10):6803-6812. doi: 10.1007/s00432-023-04643-z. Epub 2023 Feb 18.

Abstract

PURPOSE

Early identification of lung cancer (LC) will considerably facilitate the intervention and prevention of LC. The human proteome micro-arrays approach can be used as a "liquid biopsy" to diagnose LC to complement conventional diagnosis, which needs advanced bioinformatics methods such as feature selection (FS) and refined machine learning models.

METHODS

A two-stage FS methodology by infusing Pearson's Correlation (PC) with a univariate filter (SBF) or recursive feature elimination (RFE) was used to reduce the redundancy of the original dataset. The Stochastic Gradient Boosting (SGB), Random Forest (RF), and Support Vector Machine (SVM) techniques were applied to build ensemble classifiers based on four subsets. The synthetic minority oversampling technique (SMOTE) was used in the preprocessing of imbalanced data.

RESULTS

FS approach with SBF and RFE extracted 25 and 55 features, respectively, with 14 overlapped ones. All three ensemble models demonstrate superior accuracy (ranging from 0.867 to 0.967) and sensitivity (0.917 to 1.00) in the test datasets with SGB of SBF subset outperforming others. The SMOTE technique has improved the model performance in the training process. Three of the top selected candidate biomarkers (LGR4, CDC34, and GHRHR) were highly suggested to play a role in lung tumorigenesis.

CONCLUSION

A novel hybrid FS method with classical ensemble machine learning algorithms was first used in the classification of protein microarray data. The parsimony model constructed by the SGB algorithm with the appropriate FS and SMOTE approach performs well in the classification task with higher sensitivity and specificity. Standardization and innovation of bioinformatics approach for protein microarray analysis need further exploration and validation.

摘要

目的

早期发现肺癌(LC)将极大地促进 LC 的干预和预防。人类蛋白质组微阵列方法可用作“液体活检”来诊断 LC,以补充传统诊断,传统诊断需要先进的生物信息学方法,如特征选择(FS)和精炼机器学习模型。

方法

采用两阶段 FS 方法,通过将 Pearson 相关(PC)与单变量滤波器(SBF)或递归特征消除(RFE)相结合,减少原始数据集的冗余。基于四个子集,应用随机梯度提升(SGB)、随机森林(RF)和支持向量机(SVM)技术构建集成分类器。在不平衡数据的预处理中使用合成少数过采样技术(SMOTE)。

结果

SBF 和 RFE 的 FS 方法分别提取了 25 和 55 个特征,其中有 14 个重叠。所有三个集成模型在测试数据集上均表现出较高的准确性(范围为 0.867 至 0.967)和敏感性(0.917 至 1.00),其中 SBF 子集的 SGB 表现优于其他模型。SMOTE 技术提高了模型在训练过程中的性能。三个顶级候选生物标志物(LGR4、CDC34 和 GHRHR)被高度建议在肺肿瘤发生中发挥作用。

结论

首次将新型混合 FS 方法与经典集成机器学习算法用于蛋白质微阵列数据的分类。使用 SGB 算法和适当的 FS 和 SMOTE 方法构建的简约模型在分类任务中表现良好,具有较高的敏感性和特异性。蛋白质微阵列分析的生物信息学方法的标准化和创新需要进一步探索和验证。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验