Suppr超能文献

基于外显子组数据集的集成机器学习算法在癌症早期诊断预测中的应用。

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers.

机构信息

Department of Information Science and Engineering, RV College of Engineering, Bangalore, 560059, India.

Department of Computer Science and Engineering, RV College of Engineering, Bangalore, 560059, India.

出版信息

BMC Bioinformatics. 2022 Nov 18;23(1):496. doi: 10.1186/s12859-022-05050-w.

Abstract

Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis.

摘要

不同癌症类型的分类是设计早期癌症预测决策支持模型的重要步骤。使用各种机器学习 (ML) 技术和集成学习是一种用于分类的方法。在本研究中,探索了各种 ML 算法在属于 5 种癌症类型的 20 个外显子数据集上的应用。首先,对具有 88 个特征的 4181 个癌症变体进行了数据清理,并使用自然语言处理和概率分布获得了衍生数据集。然后,使用主成分分析 (PCA) 在 1D 和 2D 轴上进行了探索性数据集分析,以降低数据的高维性。为了显著减少衍生数据集的不平衡,使用 SMOTE 进行了过采样。此外,还最初在过采样数据集上使用 K-最近邻和支持向量机等分类算法。还设计了具有 1D 批量归一化的 4 层人工神经网络模型,以提高模型准确性。还使用集成 ML 技术,如装袋,以及使用 KNN、SVM 和 MLPs 作为基分类器,以提高模型的加权平均性能指标。然而,由于样本量小,模型改进具有挑战性。因此,采用了一种新的方法,使用生成对抗网络 (GAN) 和基于三元组的变分自动编码器 (TVAE) 来增加样本量,该方法重构了特征和标签,生成了数据。结果表明,从初步审查来看,KNN 的加权平均值为 0.74,SVM 为 0.76。过采样确保了衍生数据集的准确性显著提高,并且当数据分为 70:15:15 的比例(训练、测试和保留数据集)时,集成分类器将准确性提高到 82.91%。当使用 GAN 和 TVAE 增加样本量时,总体评估指标值为 0.92,而整体比较模型为 0.66。因此,本研究设计了一种有效的癌症分类模型,当应用于实际样本时,将在早期癌症诊断中发挥重要作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5239/9675216/24852e86fdce/12859_2022_5050_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验