Suppr超能文献

使用遗传编程符号分类器和决策树分类器开发用于乳腺癌类型分类的符号表达式集成。

Development of Symbolic Expressions Ensemble for Breast Cancer Type Classification Using Genetic Programming Symbolic Classifier and Decision Tree Classifier.

作者信息

Anđelić Nikola, Baressi Šegota Sandi

机构信息

Department of Automation and Electronics, Faculty of Engineering, University of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia.

出版信息

Cancers (Basel). 2023 Jun 29;15(13):3411. doi: 10.3390/cancers15133411.

Abstract

Breast cancer is a type of cancer with several sub-types. It occurs when cells in breast tissue grow out of control. The accurate sub-type classification of a patient diagnosed with breast cancer is mandatory for the application of proper treatment. Breast cancer classification based on gene expression is challenging even for artificial intelligence (AI) due to the large number of gene expressions. The idea in this paper is to utilize the genetic programming symbolic classifier (GPSC) on the publicly available dataset to obtain a set of symbolic expressions (SEs) that can classify the breast cancer sub-type using gene expressions with high classification accuracy. The initial problem with the used dataset is a large number of input variables (54,676 gene expressions), a small number of dataset samples (151 samples), and six classes of breast cancer sub-types that are highly imbalanced. The large number of input variables is solved with principal component analysis (PCA), while the small number of samples and the large imbalance between class samples are solved with the application of different oversampling methods generating different dataset variations. On each oversampled dataset, the GPSC with random hyperparameter values search (RHVS) method is trained using 5-fold cross validation (5CV) to obtain a set of SEs. The best set of SEs is chosen based on mean values of accuracy (ACC), the area under the receiving operating characteristic curve (AUC), precision, recall, and F1-score values. In this case, the highest classification accuracy is equal to 0.992 across all evaluation metric methods. The best set of SEs is additionally combined with a decision tree classifier, which slightly improves ACC to 0.994.

摘要

乳腺癌是一种具有多种亚型的癌症。当乳腺组织中的细胞生长失控时,就会发生乳腺癌。对于确诊为乳腺癌的患者,准确的亚型分类对于实施适当的治疗至关重要。由于基因表达数量众多,即使对于人工智能(AI)来说,基于基因表达的乳腺癌分类也具有挑战性。本文的思路是在公开可用的数据集上利用遗传编程符号分类器(GPSC),以获得一组符号表达式(SEs),这些表达式可以使用基因表达以高分类准确率对乳腺癌亚型进行分类。所使用数据集的初始问题是输入变量数量众多(54,676个基因表达)、数据集样本数量少(151个样本)以及六种乳腺癌亚型类别高度不平衡。通过主成分分析(PCA)解决了大量输入变量的问题,而通过应用不同的过采样方法生成不同的数据集变体,解决了样本数量少和类别样本之间巨大不平衡的问题。在每个过采样数据集上,使用5折交叉验证(5CV)对具有随机超参数值搜索(RHVS)方法的GPSC进行训练,以获得一组SEs。根据准确率(ACC)、接收操作特征曲线下面积(AUC)、精确率、召回率和F1分数值的平均值选择最佳的SEs集。在这种情况下,在所有评估指标方法中,最高分类准确率等于0.992。最佳的SEs集还与决策树分类器相结合,这将ACC略微提高到了0.994。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/533f/10340251/60929e60a0cb/cancers-15-03411-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验