基于外显子组数据集的集成机器学习算法在癌症早期诊断预测中的应用。

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers.

机构信息

Department of Information Science and Engineering, RV College of Engineering, Bangalore, 560059, India.

Department of Computer Science and Engineering, RV College of Engineering, Bangalore, 560059, India.

出版信息

BMC Bioinformatics. 2022 Nov 18;23(1):496. doi: 10.1186/s12859-022-05050-w.

DOI:10.1186/s12859-022-05050-w

PMID:36401182

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9675216/

Abstract

Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis.

摘要

不同癌症类型的分类是设计早期癌症预测决策支持模型的重要步骤。使用各种机器学习 (ML) 技术和集成学习是一种用于分类的方法。在本研究中，探索了各种 ML 算法在属于 5 种癌症类型的 20 个外显子数据集上的应用。首先，对具有 88 个特征的 4181 个癌症变体进行了数据清理，并使用自然语言处理和概率分布获得了衍生数据集。然后，使用主成分分析 (PCA) 在 1D 和 2D 轴上进行了探索性数据集分析，以降低数据的高维性。为了显著减少衍生数据集的不平衡，使用 SMOTE 进行了过采样。此外，还最初在过采样数据集上使用 K-最近邻和支持向量机等分类算法。还设计了具有 1D 批量归一化的 4 层人工神经网络模型，以提高模型准确性。还使用集成 ML 技术，如装袋，以及使用 KNN、SVM 和 MLPs 作为基分类器，以提高模型的加权平均性能指标。然而，由于样本量小，模型改进具有挑战性。因此，采用了一种新的方法，使用生成对抗网络 (GAN) 和基于三元组的变分自动编码器 (TVAE) 来增加样本量，该方法重构了特征和标签，生成了数据。结果表明，从初步审查来看，KNN 的加权平均值为 0.74，SVM 为 0.76。过采样确保了衍生数据集的准确性显著提高，并且当数据分为 70:15:15 的比例（训练、测试和保留数据集）时，集成分类器将准确性提高到 82.91%。当使用 GAN 和 TVAE 增加样本量时，总体评估指标值为 0.92，而整体比较模型为 0.66。因此，本研究设计了一种有效的癌症分类模型，当应用于实际样本时，将在早期癌症诊断中发挥重要作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5239/9675216/24852e86fdce/12859_2022_5050_Fig1_HTML.jpg

相似文献

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers.

BMC Bioinformatics. 2022 Nov 18;23(1):496. doi: 10.1186/s12859-022-05050-w.

Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID-19.

Concurr Comput. 2022 Dec 25;34(28):e7393. doi: 10.1002/cpe.7393. Epub 2022 Oct 18.

Improving mortality prediction in Acute Pancreatitis by machine learning and data augmentation.

Comput Biol Med. 2022 Nov;150:106077. doi: 10.1016/j.compbiomed.2022.106077. Epub 2022 Sep 11.

A stacking ensemble deep learning approach to cancer type classification based on TCGA data.

Sci Rep. 2021 Aug 2;11(1):15626. doi: 10.1038/s41598-021-95128-x.

EKNN: Ensemble classifier incorporating connectivity and density into kNN with application to cancer diagnosis.

Artif Intell Med. 2021 Jan;111:101985. doi: 10.1016/j.artmed.2020.101985. Epub 2020 Nov 8.

Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis.

Neural Netw. 2024 May;173:106157. doi: 10.1016/j.neunet.2024.106157. Epub 2024 Feb 2.

Decision Support System and Web-Application Using Supervised Machine Learning Algorithms for Easy Cancer Classifications.

Cancer Inform. 2023 Jan 23;22:11769351221147244. doi: 10.1177/11769351221147244. eCollection 2023.

Social Reminiscence in Older Adults' Everyday Conversations: Automated Detection Using Natural Language Processing and Machine Learning.

J Med Internet Res. 2020 Sep 15;22(9):e19133. doi: 10.2196/19133.

A novel method for detecting credit card fraud problems.

PLoS One. 2024 Mar 6;19(3):e0294537. doi: 10.1371/journal.pone.0294537. eCollection 2024.

Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection.

BMC Med Inform Decis Mak. 2022 Mar 28;22(1):82. doi: 10.1186/s12911-022-01821-w.

引用本文的文献

An online explainable ensemble machine learning model for predicting epidermal growth factor receptor mutation status in lung adenocarcinoma.

Transl Lung Cancer Res. 2025 Jul 31;14(7):2670-2687. doi: 10.21037/tlcr-2025-237. Epub 2025 Jul 28.

Tabular transformer generative adversarial network for heterogeneous distribution in healthcare.

Sci Rep. 2025 Mar 25;15(1):10254. doi: 10.1038/s41598-025-93077-3.

Hematoma expansion prediction based on SMOTE and XGBoost algorithm.

BMC Med Inform Decis Mak. 2024 Jun 19;24(1):172. doi: 10.1186/s12911-024-02561-9.

Year 2022 in Medical Natural Language Processing: Availability of Language Models as a Step in the Democratization of NLP in the Biomedical Area.

Yearb Med Inform. 2023 Aug;32(1):244-252. doi: 10.1055/s-0043-1768752. Epub 2023 Dec 26.

Decision Support System and Web-Application Using Supervised Machine Learning Algorithms for Easy Cancer Classifications.

Cancer Inform. 2023 Jan 23;22:11769351221147244. doi: 10.1177/11769351221147244. eCollection 2023.

本文引用的文献

NCA-GA-SVM: A new two-level feature selection method based on neighborhood component analysis and genetic algorithm in hepatocellular carcinoma fatality prognosis.

Int J Numer Method Biomed Eng. 2022 Jun;38(6):e3599. doi: 10.1002/cnm.3599. Epub 2022 May 11.

Evaluation of COVID-19 impact on DELAYing diagnostic-therapeutic pathways of lung cancer patients in Italy (COVID-DELAY study): fewer cases and higher stages from a real-world scenario.

ESMO Open. 2022 Apr;7(2):100406. doi: 10.1016/j.esmoop.2022.100406. Epub 2022 Feb 3.

Bone Cancer Detection Using Feature Extraction Based Machine Learning Model.

Comput Math Methods Med. 2021 Dec 20;2021:7433186. doi: 10.1155/2021/7433186. eCollection 2021.

Applied machine learning in cancer research: A systematic review for patient diagnosis, classification and prognosis.

Comput Struct Biotechnol J. 2021 Oct 6;19:5546-5555. doi: 10.1016/j.csbj.2021.10.006. eCollection 2021.

Breast Tumor Classification Using an Ensemble Machine Learning Method.

J Imaging. 2020 May 29;6(6):39. doi: 10.3390/jimaging6060039.

Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.

Quant Biol. 2020 Dec 24;8(4):347-358. doi: 10.1007/s40484-020-0226-1. Epub 2020 Dec 7.

Comparison of various approaches to combine logistic regression with genetic algorithms in survival prediction of hepatocellular carcinoma.

Comput Biol Med. 2021 Jul;134:104431. doi: 10.1016/j.compbiomed.2021.104431. Epub 2021 May 11.

A comparative study of PCS and PAM50 prostate cancer classification schemes.

Prostate Cancer Prostatic Dis. 2021 Sep;24(3):733-742. doi: 10.1038/s41391-021-00325-4. Epub 2021 Feb 2.

Real-World Scenario of Patients With Lung Cancer Amid the Coronavirus Disease 2019 Pandemic in the People's Republic of China.

JTO Clin Res Rep. 2020 Sep;1(3):100053. doi: 10.1016/j.jtocrr.2020.100053. Epub 2020 May 20.

Predictive modeling of blood pressure during hemodialysis: a comparison of linear model, random forest, support vector regression, XGBoost, LASSO regression and ensemble method.

Comput Methods Programs Biomed. 2020 Oct;195:105536. doi: 10.1016/j.cmpb.2020.105536. Epub 2020 May 22.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于外显子组数据集的集成机器学习算法在癌症早期诊断预测中的应用。

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献