机器学习算法在分析矩形生物医学数据方面的性能和效率。

Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data.

机构信息

School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China.

Department of Infectious Disease, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine Shanghai, Shanghai, China.

出版信息

Lab Invest. 2021 Apr;101(4):430-441. doi: 10.1038/s41374-020-00525-x. Epub 2021 Feb 11.

DOI:10.1038/s41374-020-00525-x

PMID:33574440

Abstract

Most biomedical datasets, including those of 'omics, population studies, and surveys, are rectangular in shape and have few missing data. Recently, their sample sizes have grown significantly. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms. Machine learning (ML) algorithms have been used to classify outcomes in biomedical datasets, including random forests (RF), decision tree (DT), artificial neural networks (ANN), and support vector machine (SVM). However, their performance and efficiency in classifying multi-category outcomes of rectangular data are poorly understood. Therefore, we compared these metrics among the 4 ML algorithms. As an example, we created a large rectangular dataset using the female breast cancers in the surveillance, epidemiology, and end results-18 database, which were diagnosed in 2004 and followed up until December 2016. The outcome was the five-category cause of death, namely alive, non-breast cancer, breast cancer, cardiovascular disease, and other cause. We analyzed the 54 dichotomized features from ~45,000 patients using MatLab (version 2018a) and the tenfold cross-validation approach. The accuracy in classifying five-category cause of death with DT, RF, ANN, and SVM was 69.21%, 70.23%, 70.16%, and 69.06%, respectively, which was higher than the accuracy of 68.12% with multinomial logistic regression. Based on the features' information entropy, we optimized dimension reduction (i.e., reduce the number of features in models). We found 32 or more features were required to maintain similar accuracy, while the running time decreased from 55.57 s for 54 features to 25.99 s for 32 features in RF, from 12.92 s to 10.48 s in ANN, and from 175.50 s to 67.81 s in SVM. In summary, we here show that RF, DT, ANN, and SVM had similar accuracy for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will increase the model's efficiency while maintaining classification accuracy.

摘要

大多数生物医学数据集，包括组学、人群研究和调查，都是矩形的，并且数据缺失很少。最近，它们的样本量显著增加。对这些大型数据集进行严格分析需要更高效和更准确的算法。机器学习（ML）算法已被用于对生物医学数据集中的结果进行分类，包括随机森林（RF）、决策树（DT）、人工神经网络（ANN）和支持向量机（SVM）。然而，它们在分类矩形数据的多类别结果方面的性能和效率还了解甚少。因此，我们比较了这 4 种 ML 算法的这些指标。例如，我们使用监测、流行病学和结果-18 数据库中的女性乳腺癌创建了一个大型矩形数据集，这些乳腺癌于 2004 年诊断，并随访至 2016 年 12 月。结果是五种死因，即存活、非乳腺癌、乳腺癌、心血管疾病和其他原因。我们使用 MatLab（版本 2018a）和 10 倍交叉验证方法分析了来自~45000 名患者的 54 个二分类特征。使用 DT、RF、ANN 和 SVM 对五类死因进行分类的准确率分别为 69.21%、70.23%、70.16%和 69.06%，高于多项逻辑回归的 68.12%的准确率。基于特征的信息熵，我们对降维（即减少模型中的特征数量）进行了优化。我们发现，要保持相似的准确性，需要 32 个或更多的特征，而 RF 中的运行时间从 54 个特征的 55.57 秒减少到 32 个特征的 25.99 秒，ANN 从 12.92 秒减少到 10.48 秒，SVM 从 175.50 秒减少到 67.81 秒。总之，我们在这里表明，RF、DT、ANN 和 SVM 在对这个大型矩形数据集的多类别结果进行分类时具有相似的准确性。基于信息增益的降维将提高模型的效率，同时保持分类准确性。

相似文献

Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data.机器学习算法在分析矩形生物医学数据方面的性能和效率。

Lab Invest. 2021 Apr;101(4):430-441. doi: 10.1038/s41374-020-00525-x. Epub 2021 Feb 11.

Diagnosis of urinary tract infection based on artificial intelligence methods.基于人工智能方法的尿路感染诊断。

Comput Methods Programs Biomed. 2018 Nov;166:51-59. doi: 10.1016/j.cmpb.2018.10.007. Epub 2018 Oct 2.

Diagnostic Accuracy of Different Machine Learning Algorithms for Breast Cancer Risk Calculation: a Meta-Analysis.不同机器学习算法用于乳腺癌风险计算的诊断准确性：一项荟萃分析

Asian Pac J Cancer Prev. 2018 Jul 27;19(7):1747-1752. doi: 10.22034/APJCP.2018.19.7.1747.

Machine learning in medicine: a practical introduction.医学中的机器学习：实用入门

BMC Med Res Methodol. 2019 Mar 19;19(1):64. doi: 10.1186/s12874-019-0681-4.

The BCPM method: decoding breast cancer with machine learning.BCPM 方法：用机器学习解码乳腺癌。

BMC Med Imaging. 2024 Sep 17;24(1):248. doi: 10.1186/s12880-024-01402-5.

Machine learning in the classification of asian rust severity in soybean using hyperspectral sensor.基于高光谱传感器的机器学习在大豆亚洲锈病严重程度分类中的应用。

Spectrochim Acta A Mol Biomol Spectrosc. 2024 May 15;313:124113. doi: 10.1016/j.saa.2024.124113. Epub 2024 Mar 4.

Comparison of Supervised Machine Learning Algorithms for Classifying of Home Discharge Possibility in Convalescent Stroke Patients: A Secondary Analysis.基于机器学习的监督算法在恢复期脑卒中患者居家康复可能性分类中的比较：二次分析。

J Stroke Cerebrovasc Dis. 2021 Oct;30(10):106011. doi: 10.1016/j.jstrokecerebrovasdis.2021.106011. Epub 2021 Jul 26.

Comparison of Classification Success Rates of Different Machine Learning Algorithms in the Diagnosis of Breast Cancer.不同机器学习算法在乳腺癌诊断中的分类成功率比较。

Asian Pac J Cancer Prev. 2022 Oct 1;23(10):3287-3297. doi: 10.31557/APJCP.2022.23.10.3287.

Machine learning models in breast cancer survival prediction.用于乳腺癌生存预测的机器学习模型。

Technol Health Care. 2016;24(1):31-42. doi: 10.3233/THC-151071.

Can machine learning predict pharmacotherapy outcomes? An application study in osteoporosis.机器学习能预测药物治疗效果吗？一项在骨质疏松症中的应用研究。

Comput Methods Programs Biomed. 2022 Oct;225:107028. doi: 10.1016/j.cmpb.2022.107028. Epub 2022 Jul 21.

引用本文的文献

Towards machine learning fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients.迈向结直肠癌或肺癌患者多类别死亡原因分类中的机器学习公平性。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf398.

Role of immature granulocyte and blood biomarkers in predicting perforated acute appendicitis using machine learning model.未成熟粒细胞和血液生物标志物在使用机器学习模型预测穿孔性急性阑尾炎中的作用。

World J Clin Cases. 2025 Aug 6;13(22):104379. doi: 10.12998/wjcc.v13.i22.104379.

Integrative analysis of multi-omics data and gut microbiota composition reveals prognostic subtypes and predicts immunotherapy response in colorectal cancer using machine learning.多组学数据与肠道微生物群组成的综合分析揭示了预后亚型，并使用机器学习预测结直肠癌的免疫治疗反应。

Sci Rep. 2025 Jul 12;15(1):25268. doi: 10.1038/s41598-025-08915-1.

Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data.归一化和选择非差异表达基因可改善跨平台转录组数据的机器学习建模

Trans Artif Intell. 2025;1(1). doi: 10.53941/tai.2025.100005. Epub 2025 May 25.

Classification of biomedical lung cancer images using optimized binary bat technique by constructing oblique decision trees.通过构建斜决策树，使用优化的二进制蝙蝠技术对生物医学肺癌图像进行分类。

Sci Rep. 2025 May 29;15(1):18954. doi: 10.1038/s41598-025-02954-4.

Applications of machine learning approaches for pediatric asthma exacerbation management: a systematic review.机器学习方法在儿童哮喘急性发作管理中的应用：一项系统综述。

BMC Med Inform Decis Mak. 2025 Apr 18;25(1):170. doi: 10.1186/s12911-025-02990-0.

Towards machine learning fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients.迈向结直肠癌或肺癌患者多类别死因分类中的机器学习公平性

bioRxiv. 2025 Feb 19:2025.02.14.638368. doi: 10.1101/2025.02.14.638368.

Development of a novel calculator to predict gonadotropin dose and oocyte yield in oocyte cryopreservation cycles.一种用于预测卵母细胞冷冻保存周期中促性腺激素剂量和卵母细胞产量的新型计算器的开发。

J Assist Reprod Genet. 2025 Feb;42(2):423-432. doi: 10.1007/s10815-024-03372-7. Epub 2025 Jan 7.

A simplified approach for efficiency analysis of machine learning algorithms.一种用于机器学习算法效率分析的简化方法。

PeerJ Comput Sci. 2024 Nov 28;10:e2418. doi: 10.7717/peerj-cs.2418. eCollection 2024.

Application of Isokinetic Dynamometry Data in Predicting Gait Deviation Index Using Machine Learning in Stroke Patients: A Cross-Sectional Study.等速动力学数据在基于机器学习的脑卒中患者步态偏差指数预测中的应用：一项横断面研究。

Sensors (Basel). 2024 Nov 13;24(22):7258. doi: 10.3390/s24227258.

本文引用的文献

Predicting long-term multicategory cause of death in patients with prostate cancer: random forest versus multinomial model.预测前列腺癌患者的长期多类别死亡原因：随机森林与多项模型对比

Am J Cancer Res. 2020 May 1;10(5):1344-1355. eCollection 2020.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

机器学习算法在分析矩形生物医学数据方面的性能和效率。

Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献