癌症研究中基于机器学习的DNA微阵列清晰和模糊分类的接收器操作特征（ROC）曲线。

Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research.

作者信息

Peterson Leif E, Coleman Matthew A

机构信息

Baylor College of Medicine, Houston, Texas 77030 USA.

出版信息

Int J Approx Reason. 2008 Jan;47(1):17-36. doi: 10.1016/j.ijar.2007.03.006.

DOI:10.1016/j.ijar.2007.03.006

PMID:19079753

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2600874/

Abstract

Receiver operating characteristic (ROC) curves were generated to obtain classification area under the curve (AUC) as a function of feature standardization, fuzzification, and sample size from nine large sets of cancer-related DNA microarrays. Classifiers used included k nearest neighbor (kNN), näive Bayes classifier (NBC), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), learning vector quantization (LVQ1), logistic regression (LOG), polytomous logistic regression (PLOG), artificial neural networks (ANN), particle swarm optimization (PSO), constricted particle swarm optimization (CPSO), kernel regression (RBF), radial basis function networks (RBFN), gradient descent support vector machines (SVMGD), and least squares support vector machines (SVMLS). For each data set, AUC was determined for a number of combinations of sample size, total sum[-log(p)] of feature t-tests, with and without feature standardization and with (fuzzy) and without (crisp) fuzzification of features. Altogether, a total of 2,123,530 classification runs were made. At the greatest level of sample size, ANN resulted in a fitted AUC of 90%, while PSO resulted in the lowest fitted AUC of 72.1%. AUC values derived from 4NN were the most dependent on sample size, while PSO was the least. ANN depended the most on total statistical significance of features used based on sum[-log(p)], whereas PSO was the least dependent. Standardization of features increased AUC by 8.1% for PSO and -0.2% for QDA, while fuzzification increased AUC by 9.4% for PSO and reduced AUC by 3.8% for QDA. AUC determination in planned microarray experiments without standardization and fuzzification of features will benefit the most if CPSO is used for lower levels of feature significance (i.e., sum[-log(p)] ~ 50) and ANN is used for greater levels of significance (i.e., sum[-log(p)] ~ 500). When only standardization of features is performed, studies are likely to benefit most by using CPSO for low levels of feature statistical significance and LVQ1 for greater levels of significance. Studies involving only fuzzification of features should employ LVQ1 because of the substantial gain in AUC observed and low expense of LVQ1. Lastly, PSO resulted in significantly greater levels of AUC (89.5% average) when feature standardization and fuzzification were performed. In consideration of the data sets used and factors influencing AUC which were investigated, if low-expense computation is desired then LVQ1 is recommended. However, if computational expense is of less concern, then PSO or CPSO is recommended.

摘要

生成了受试者工作特征（ROC）曲线，以获取曲线下分类面积（AUC），该面积是九个大型癌症相关DNA微阵列的特征标准化、模糊化和样本量的函数。所使用的分类器包括k近邻（kNN）、朴素贝叶斯分类器（NBC）、线性判别分析（LDA）、二次判别分析（QDA）、学习向量量化（LVQ1）、逻辑回归（LOG）、多分类逻辑回归（PLOG）、人工神经网络（ANN）、粒子群优化（PSO）、收缩粒子群优化（CPSO）、核回归（RBF）、径向基函数网络（RBFN）、梯度下降支持向量机（SVMGD）和最小二乘支持向量机（SVMLS）。对于每个数据集，针对样本量、特征t检验的总和[-log(p)]的多种组合，在有和没有特征标准化以及有（模糊）和没有（清晰）特征模糊化的情况下确定AUC。总共进行了2,123,530次分类运行。在最大样本量水平下，ANN得出的拟合AUC为90%，而PSO得出的拟合AUC最低，为72.1%。源自4NN的AUC值对样本量的依赖性最大，而PSO的依赖性最小。ANN对基于总和[-log(p)]使用的特征的总统计显著性依赖性最大，而PSO的依赖性最小。特征标准化使PSO的AUC提高了8.1%，使QDA的AUC降低了0.2%，而模糊化使PSO的AUC提高了9.4%，使QDA的AUC降低了3.8%。如果在没有特征标准化和模糊化的计划微阵列实验中确定AUC，那么对于较低水平的特征显著性（即总和[-log(p)]~~50）使用CPSO，对于较高水平的显著性（即总和[-log(p)]~~500）使用ANN将最有益。当仅进行特征标准化时，对于低水平的特征统计显著性使用CPSO，对于较高水平的显著性使用LVQ1，研究可能会受益最大。仅涉及特征模糊化的研究应采用LVQ1，因为观察到AUC有显著提高且LVQ1成本较低。最后，当进行特征标准化和模糊化时，PSO得出的AUC水平显著更高（平均89.5%）。考虑到所使用的数据集以及所研究的影响AUC的因素，如果希望进行低成本计算，则推荐使用LVQ1。然而，如果对计算成本不太关注，则推荐使用PSO或CPSO。

相似文献

Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research.癌症研究中基于机器学习的DNA微阵列清晰和模糊分类的接收器操作特征（ROC）曲线。

Int J Approx Reason. 2008 Jan;47(1):17-36. doi: 10.1016/j.ijar.2007.03.006.

Seminal quality prediction using data mining methods.使用数据挖掘方法进行精液质量预测。

Technol Health Care. 2014;22(4):531-45. doi: 10.3233/THC-140816.

A comprehensive study of brain tumour discrimination using phase combinations, feature rankings, and hybridised classifiers.使用相位组合、特征排序和混合分类器对脑肿瘤进行全面研究。

Med Biol Eng Comput. 2020 Dec;58(12):2971-2987. doi: 10.1007/s11517-020-02273-y. Epub 2020 Oct 2.

A comparative analysis of feature selection models for spatial analysis of floods using hybrid metaheuristic and machine learning models.使用混合元启发式算法和机器学习模型进行洪水空间分析的特征选择模型的比较分析

Environ Sci Pollut Res Int. 2024 May;31(23):33495-33514. doi: 10.1007/s11356-024-33389-5. Epub 2024 Apr 29.

Multimodality radiomics prediction of radiotherapy-induced the early proctitis and cystitis in rectal cancer patients: a machine learning study.多模态放射组学预测直肠癌患者放疗诱导的早期直肠炎和膀胱炎：一项机器学习研究。

Biomed Phys Eng Express. 2023 Dec 20;10(1). doi: 10.1088/2057-1976/ad0f3e.

Hybrid Feature-Learning-Based PSO-PCA Feature Engineering Approach for Blood Cancer Classification.基于混合特征学习的粒子群优化-主成分分析特征工程方法用于血癌分类

Diagnostics (Basel). 2023 Aug 14;13(16):2672. doi: 10.3390/diagnostics13162672.

Fissures segmentation using surface features: content-based retrieval for mammographic mass using ensemble classifier.利用表面特征进行裂隙分割：基于内容的乳腺肿块检索使用集成分类器。

Acad Radiol. 2011 Dec;18(12):1475-84. doi: 10.1016/j.acra.2011.08.012.

Identification of a feature selection based pattern recognition scheme for finger movement recognition from multichannel EMG signals.基于特征选择的模式识别方案用于从多通道肌电信号中识别手指运动

Australas Phys Eng Sci Med. 2018 Jun;41(2):549-559. doi: 10.1007/s13246-018-0646-7. Epub 2018 May 9.

Classification of electrocardiogram signals with support vector machines and particle swarm optimization.基于支持向量机和粒子群优化的心电图信号分类

IEEE Trans Inf Technol Biomed. 2008 Sep;12(5):667-77. doi: 10.1109/TITB.2008.923147.

Role of sureness in evaluating AI/CADx: Lesion-based repeatability of machine learning classification performance on breast MRI.Surety 在评估 AI/CADx 中的作用：基于病灶的机器学习分类性能在乳腺 MRI 上的重复性。

Med Phys. 2024 Mar;51(3):1812-1821. doi: 10.1002/mp.16673. Epub 2023 Aug 21.

引用本文的文献

Comparative Analysis of Feature Extraction Methods and Machine Learning Models for Predicting Osteoporosis Prevalence.用于预测骨质疏松症患病率的特征提取方法和机器学习模型的比较分析

J Med Syst. 2025 May 29;49(1):72. doi: 10.1007/s10916-025-02203-1.

Identifying Candidate Gene-Disease Associations via Graph Neural Networks.通过图神经网络识别候选基因与疾病的关联

Entropy (Basel). 2023 Jun 7;25(6):909. doi: 10.3390/e25060909.

Computation of the distribution of model accuracy statistics in machine learning: Comparison between analytically derived distributions and simulation-based methods.机器学习中模型准确性统计分布的计算：解析推导分布与基于模拟方法的比较。

Health Sci Rep. 2023 Apr 20;6(4):e1214. doi: 10.1002/hsr2.1214. eCollection 2023 Apr.

A comprehensive survey on computational learning methods for analysis of gene expression data.关于用于基因表达数据分析的计算学习方法的全面综述。

Front Mol Biosci. 2022 Nov 7;9:907150. doi: 10.3389/fmolb.2022.907150. eCollection 2022.

Predictions from algorithmic modeling result in better decisions than from data modeling for soybean iron deficiency chlorosis.算法模型的预测结果比数据模型更能准确预测大豆缺铁性黄化。

PLoS One. 2021 Jul 9;16(7):e0240948. doi: 10.1371/journal.pone.0240948. eCollection 2021.

3-Dimensional facial expression recognition in human using multi-points warping.利用多点变形进行人类三维面部表情识别。

BMC Bioinformatics. 2019 Dec 2;20(1):619. doi: 10.1186/s12859-019-3153-2.

Transdiagnostic dimensions of psychosis in the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP).双相情感障碍-精神分裂症中间型网络（B-SNIP）中精神病的跨诊断维度。

World Psychiatry. 2019 Feb;18(1):67-76. doi: 10.1002/wps.20607.

Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data.基于模糊粗糙集的F信息与人类基因表达数据的水漩涡算法相结合自动检测癌症相关基因

PLoS One. 2016 Dec 9;11(12):e0167504. doi: 10.1371/journal.pone.0167504. eCollection 2016.

QCT of the proximal femur--which parameters should be measured to discriminate hip fracture?股骨近端的定量CT——为鉴别髋部骨折应测量哪些参数？

Osteoporos Int. 2016 Mar;27(3):1137-1147. doi: 10.1007/s00198-015-3324-6. Epub 2015 Sep 28.

Identification of host-microbe interaction factors in the genomes of soft rot-associated pathogens Dickeya dadantii 3937 and Pectobacterium carotovorum WPP14 with supervised machine learning.利用监督式机器学习鉴定与软腐相关的病原菌达旦氏菌3937和胡萝卜软腐果胶杆菌WPP14基因组中的宿主-微生物相互作用因子。

BMC Genomics. 2014 Jun 21;15:508. doi: 10.1186/1471-2164-15-508.

本文引用的文献

Optimized Particle Swarm Optimization (OPSO) and its application to artificial neural network training.优化粒子群优化算法（OPSO）及其在人工神经网络训练中的应用。

BMC Bioinformatics. 2006 Mar 10;7:125. doi: 10.1186/1471-2105-7-125.

A simple method for assessing sample sizes in microarray experiments.一种评估微阵列实验样本量的简单方法。

BMC Bioinformatics. 2006 Mar 2;7:106. doi: 10.1186/1471-2105-7-106.

The PowerAtlas: a power and sample size atlas for microarray experimental design and research.《功效图谱》：用于微阵列实验设计与研究的功效和样本量图谱。

BMC Bioinformatics. 2006 Feb 22;7:84. doi: 10.1186/1471-2105-7-84.

An interactive power analysis tool for microarray hypothesis testing and generation.用于微阵列假设检验和生成的交互式功效分析工具。

Bioinformatics. 2006 Apr 1;22(7):808-14. doi: 10.1093/bioinformatics/btk052. Epub 2006 Jan 17.

Split-plot microarray experiments: issues of design, power and sample size.裂区微阵列实验：设计、功效和样本量问题

Appl Bioinformatics. 2005;4(3):187-94. doi: 10.2165/00822942-200504030-00003.

FDR-controlling testing procedures and sample size determination for microarrays.用于微阵列的错误发现率控制测试程序和样本量确定

Stat Med. 2005 Aug 15;24(15):2267-80. doi: 10.1002/sim.2119.

Sample size calculation for multiple testing in microarray data analysis.微阵列数据分析中多重检验的样本量计算。

Biostatistics. 2005 Jan;6(1):157-69. doi: 10.1093/biostatistics/kxh026.

Sample size for identifying differentially expressed genes in microarray experiments.微阵列实验中用于鉴定差异表达基因的样本量。

J Comput Biol. 2004;11(4):714-26. doi: 10.1089/cmb.2004.11.714.

Sample size for gene expression microarray experiments.基因表达微阵列实验的样本量

Bioinformatics. 2005 Apr 15;21(8):1502-8. doi: 10.1093/bioinformatics/bti162. Epub 2004 Nov 25.

Sample size for detecting differentially expressed genes in microarray experiments.用于检测微阵列实验中差异表达基因的样本量。

BMC Genomics. 2004 Nov 8;5:87. doi: 10.1186/1471-2164-5-87.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验