一种基于多重过滤和监督属性聚类算法的集成机器学习模型，用于对癌症样本进行分类。

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.

作者信息

Bose Shilpi, Das Chandra, Banerjee Abhik, Ghosh Kuntal, Chattopadhyay Matangini, Chattopadhyay Samiran, Barik Aishwarya

机构信息

Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India.

Machine Intelligence Unit & Center for Soft Computing Research, Indian Statistical Institute, Kolkata, West Bengal, India.

出版信息

PeerJ Comput Sci. 2021 Sep 16;7:e671. doi: 10.7717/peerj-cs.671. eCollection 2021.

DOI:10.7717/peerj-cs.671

PMID:34616883

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8459790/

Abstract

BACKGROUND

Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis.

METHODS

In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets.

RESULTS

To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.

摘要

背景

机器学习是一种从数据中学习并从大型复杂数据集中检测内在模式的机器智能技术。由于这种能力，机器学习技术在医学应用中被广泛使用，特别是在使用大规模基因组和蛋白质组数据的情况下。基于生物分子谱数据的癌症分类是医学应用中的一个非常重要的课题，因为它提高了癌症的诊断准确性，并使癌症治疗能够成功完成。因此，机器学习技术在癌症检测和预后中被广泛使用。

方法

在本文中，提出了一种新的集成机器学习分类模型，即基于多重过滤和监督属性聚类算法的集成分类模型（MFSAC-EC），该模型可以处理类不平衡问题和微阵列数据集的高维性。该模型首先从原始训练数据中生成多个自采样数据集，在这些数据集中应用过采样程序来处理类不平衡问题。然后将所提出的MFSAC方法应用于每个自采样数据集以生成子数据集，每个子数据集都包含原始数据集的最相关/信息丰富属性的子集。MFSAC方法是一种将多个过滤器与一种新的监督属性聚类算法相结合的特征选择技术。然后针对每个子数据集分别构建一个基础分类器，最后，使用多数投票技术将这些基础分类器的预测准确性进行组合，形成基于MFSAC的集成分类器。此外，根据这些子数据集中出现的频率，选择一些信息最丰富的属性作为重要特征。

结果

为了评估所提出的MFSAC-EC模型的性能，将其应用于不同的高维微阵列基因表达数据集进行癌症样本分类。将所提出的模型与现有的知名模型进行比较，以确定其相对于其他模型的有效性。从实验结果中发现，与其他现有的知名模型相比，所提出的分类器的泛化性能/测试准确性明显更好。除此之外，还发现所提出的模型可以识别许多重要属性/生物标志物基因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fcde/8459790/de927e51d524/peerj-cs-07-671-g001.jpg

相似文献

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.一种基于多重过滤和监督属性聚类算法的集成机器学习模型，用于对癌症样本进行分类。

PeerJ Comput Sci. 2021 Sep 16;7:e671. doi: 10.7717/peerj-cs.671. eCollection 2021.

R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.R-Ensembler：一种基于粗糙集的贪婪集成属性选择算法，具有 kNN 插补功能，用于医学数据的分类。

Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.

A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data.一种用于微阵列数据的、基于集成分类器的新型生物启发式混合多滤波器包装基因选择方法。

Neural Comput Appl. 2023;35(16):11531-11561. doi: 10.1007/s00521-021-06459-9. Epub 2021 Sep 12.

Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型，对于使用可穿戴设备进行压力预测具有良好的泛化能力。

J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.

Cancer Classification Utilizing Voting Classifier with Ensemble Feature Selection Method and Transcriptomic Data.利用集成特征选择方法和转录组数据的投票分类器进行癌症分类。

Genes (Basel). 2023 Sep 14;14(9):1802. doi: 10.3390/genes14091802.

Improved intelligent water drop-based hybrid feature selection method for microarray data processing.基于智能水滴的改进型混合特征选择方法在微阵列数据处理中的应用。

Comput Biol Chem. 2023 Apr;103:107809. doi: 10.1016/j.compbiolchem.2022.107809. Epub 2023 Jan 13.

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.机器学习中特征选择的最佳评分对及其在癌症预后预测中的应用。

BMC Bioinformatics. 2011 Sep 23;12:375. doi: 10.1186/1471-2105-12-375.

Ensemble of heterogeneous classifiers for diagnosis and prediction of coronary artery disease with reduced feature subset.用于冠状动脉疾病诊断和预测的具有简化特征子集的异构分类器集成

Comput Methods Programs Biomed. 2021 Jan;198:105770. doi: 10.1016/j.cmpb.2020.105770. Epub 2020 Sep 30.

Mixture classification model based on clinical markers for breast cancer prognosis.基于临床标志物的乳腺癌预后混合分类模型。

Artif Intell Med. 2010 Feb-Mar;48(2-3):129-37. doi: 10.1016/j.artmed.2009.07.008. Epub 2009 Dec 14.

EMLI-ICC: an ensemble machine learning-based integration algorithm for metastasis prediction and risk stratification in intrahepatic cholangiocarcinoma.EMLI-ICC：一种基于集成机器学习的整合算法，用于预测肝内胆管癌的转移和风险分层。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac450.

引用本文的文献

Mapping Potential Malaria Vector Larval Habitats for Larval Source Management in Western Kenya: Introduction to Multimodel Ensembling Approaches.在肯尼亚西部进行幼虫来源管理的潜在疟疾媒介幼虫栖息地绘图：多模型集成方法简介。

Am J Trop Med Hyg. 2024 Feb 13;110(3):421-430. doi: 10.4269/ajtmh.23-0108. Print 2024 Mar 6.

本文引用的文献

A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data.基于微阵列基因表达数据对癌症类型进行分类的机器学习和深度学习算法的比较研究。

PeerJ Comput Sci. 2020 Apr 13;6:e270. doi: 10.7717/peerj-cs.270. eCollection 2020.

Cooperating, congenital neutropenia-associated Csf3r and Runx1 mutations activate pro-inflammatory signaling and inhibit myeloid differentiation of mouse HSPCs.协同作用的，与先天性中性粒细胞减少症相关的 CSF3R 和 RUNX1 突变激活促炎信号，并抑制小鼠 HSPCs 的髓系分化。

Ann Hematol. 2020 Oct;99(10):2329-2338. doi: 10.1007/s00277-020-04194-0. Epub 2020 Aug 3.

Not Only Mutations Matter: Molecular Picture of Acute Myeloid Leukemia Emerging from Transcriptome Studies.并非只有突变起作用：转录组研究揭示的急性髓系白血病分子图景

J Oncol. 2019 Jul 30;2019:7239206. doi: 10.1155/2019/7239206. eCollection 2019.

Altered expression of CSF3R splice variants impacts signal response and is associated with SRSF2 mutations.CSF3R 剪接变异体的表达改变影响信号反应，并与 SRSF2 突变相关。

Leukemia. 2020 Feb;34(2):369-379. doi: 10.1038/s41375-019-0567-9. Epub 2019 Aug 28.

TIMP-3 as a therapeutic target for cancer.基质金属蛋白酶组织抑制因子-3作为癌症的治疗靶点。

Ther Adv Med Oncol. 2019 Jul 16;11:1758835919864247. doi: 10.1177/1758835919864247. eCollection 2019.

Long non-coding RNA MBNL1-AS1 regulates proliferation, migration, and invasion of cancer stem cells in colon cancer by interacting with MYL9 via sponging microRNA-412-3p.长链非编码 RNA MBNL1-AS1 通过与 MYL9 相互作用海绵吸附 microRNA-412-3p 调控结肠癌中癌症干细胞的增殖、迁移和侵袭。

Clin Res Hepatol Gastroenterol. 2020 Feb;44(1):101-114. doi: 10.1016/j.clinre.2019.05.001. Epub 2019 Jun 26.

Ultra-Sensitive Deep Sequencing in Patients With Severe Congenital Neutropenia.严重先天性中性粒细胞减少症患者的超灵敏深度测序。

Front Immunol. 2019 Feb 28;10:116. doi: 10.3389/fimmu.2019.00116. eCollection 2019.

ALDH1A1 expression is associated with poor differentiation, 'right-sidedness' and poor survival in human colorectal cancer.ALDH1A1 的表达与人类结直肠癌的分化不良、“右侧化”和不良预后相关。

PLoS One. 2018 Oct 11;13(10):e0205536. doi: 10.1371/journal.pone.0205536. eCollection 2018.

Efficient feature selection and classification for microarray data.高效的微阵列数据分析中的特征选择与分类。

PLoS One. 2018 Aug 20;13(8):e0202167. doi: 10.1371/journal.pone.0202167. eCollection 2018.

CSF3R Mutations are frequently associated with abnormalities of RUNX1, CBFB, CEBPA, and NPM1 genes in acute myeloid leukemia.CSF3R 突变常与急性髓系白血病中 RUNX1、CBFB、CEBPA 和 NPM1 基因的异常有关。

Cancer. 2018 Aug;124(16):3329-3338. doi: 10.1002/cncr.31586. Epub 2018 Jun 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种基于多重过滤和监督属性聚类算法的集成机器学习模型，用于对癌症样本进行分类。

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

背景

方法

结果

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献