Suppr超能文献

机器学习算法在分析矩形生物医学数据方面的性能和效率。

Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data.

机构信息

School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China.

Department of Infectious Disease, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine Shanghai, Shanghai, China.

出版信息

Lab Invest. 2021 Apr;101(4):430-441. doi: 10.1038/s41374-020-00525-x. Epub 2021 Feb 11.

Abstract

Most biomedical datasets, including those of 'omics, population studies, and surveys, are rectangular in shape and have few missing data. Recently, their sample sizes have grown significantly. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms. Machine learning (ML) algorithms have been used to classify outcomes in biomedical datasets, including random forests (RF), decision tree (DT), artificial neural networks (ANN), and support vector machine (SVM). However, their performance and efficiency in classifying multi-category outcomes of rectangular data are poorly understood. Therefore, we compared these metrics among the 4 ML algorithms. As an example, we created a large rectangular dataset using the female breast cancers in the surveillance, epidemiology, and end results-18 database, which were diagnosed in 2004 and followed up until December 2016. The outcome was the five-category cause of death, namely alive, non-breast cancer, breast cancer, cardiovascular disease, and other cause. We analyzed the 54 dichotomized features from ~45,000 patients using MatLab (version 2018a) and the tenfold cross-validation approach. The accuracy in classifying five-category cause of death with DT, RF, ANN, and SVM was 69.21%, 70.23%, 70.16%, and 69.06%, respectively, which was higher than the accuracy of 68.12% with multinomial logistic regression. Based on the features' information entropy, we optimized dimension reduction (i.e., reduce the number of features in models). We found 32 or more features were required to maintain similar accuracy, while the running time decreased from 55.57 s for 54 features to 25.99 s for 32 features in RF, from 12.92 s to 10.48 s in ANN, and from 175.50 s to 67.81 s in SVM. In summary, we here show that RF, DT, ANN, and SVM had similar accuracy for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will increase the model's efficiency while maintaining classification accuracy.

摘要

大多数生物医学数据集,包括组学、人群研究和调查,都是矩形的,并且数据缺失很少。最近,它们的样本量显著增加。对这些大型数据集进行严格分析需要更高效和更准确的算法。机器学习(ML)算法已被用于对生物医学数据集中的结果进行分类,包括随机森林(RF)、决策树(DT)、人工神经网络(ANN)和支持向量机(SVM)。然而,它们在分类矩形数据的多类别结果方面的性能和效率还了解甚少。因此,我们比较了这 4 种 ML 算法的这些指标。例如,我们使用监测、流行病学和结果-18 数据库中的女性乳腺癌创建了一个大型矩形数据集,这些乳腺癌于 2004 年诊断,并随访至 2016 年 12 月。结果是五种死因,即存活、非乳腺癌、乳腺癌、心血管疾病和其他原因。我们使用 MatLab(版本 2018a)和 10 倍交叉验证方法分析了来自~45000 名患者的 54 个二分类特征。使用 DT、RF、ANN 和 SVM 对五类死因进行分类的准确率分别为 69.21%、70.23%、70.16%和 69.06%,高于多项逻辑回归的 68.12%的准确率。基于特征的信息熵,我们对降维(即减少模型中的特征数量)进行了优化。我们发现,要保持相似的准确性,需要 32 个或更多的特征,而 RF 中的运行时间从 54 个特征的 55.57 秒减少到 32 个特征的 25.99 秒,ANN 从 12.92 秒减少到 10.48 秒,SVM 从 175.50 秒减少到 67.81 秒。总之,我们在这里表明,RF、DT、ANN 和 SVM 在对这个大型矩形数据集的多类别结果进行分类时具有相似的准确性。基于信息增益的降维将提高模型的效率,同时保持分类准确性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验