• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

如何使用学习曲线评估利用机器学习算法开发的疟疾预测模型的样本量。

How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.

作者信息

Zaloumis Sophie G, Rajasekhar Megha, Simpson Julie A

机构信息

Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Carlton, VIC, Australia.

MISCH (Methods and Implementation Support for Clinical Health) Research Hub, Faculty of Medicine, Dentistry, and Health Sciences, University of Melbourne, Carlton, VIC, Australia.

出版信息

Malar J. 2025 Jul 24;24(1):242. doi: 10.1186/s12936-025-05479-3.

DOI:10.1186/s12936-025-05479-3
PMID:40708012
Abstract

BACKGROUND

Machine learning algorithms have been used to predict malaria risk and severity, identify immunity biomarkers for malaria vaccine candidates, and determine molecular biomarkers of antimalarial drug resistance. Developing these prediction models requires large training datasets to ensure prediction accuracy when applied to new individuals in the target population. Learning curves can be used to assess the sample size required for the training dataset by evaluating the predictive performance of a model trained using different dataset sizes. These curves are agnostic to the specific prediction model, but their construction does require existing data. This tutorial demonstrates how to generate and interpret learning curves for malaria prediction models developed using machine learning algorithms.

METHODS

To illustrate the approach, training dataset sizes were evaluated to inform the design of a "mock" prediction modelling study aimed to predict the artemisinin resistance status of Plasmodium falciparum malaria isolates from gene expression data. Data were simulated based on a previously published in vivo parasite gene expression dataset, which contained transcriptomes of 1043 P. falciparum isolates from patients with acute malaria, of which 29% (299/1043) were from slow clearing infections (parasite clearance half-life > 5 h). Learning curves were produced for two machine learning algorithms, sparse Partial Least Squares-Discriminant Analysis plus Support Vector Machines (sPLSDA + SVMs) and random forests. Prediction error was measured using the balanced error rate (average of percentage of slow clearing infections incorrectly predicted as fast and percentage of fast clearing infections predicted as slow).

RESULTS

For this mock malaria prediction study, the balanced error rate on a test dataset not used for model training (208 samples) was 50% for sPLSDA + SVMs and 50% for random forests on the smallest training dataset evaluated (20 samples) and 14% for sPLSDA + SVMs and 22% for random forests on the largest training dataset evaluated (835 samples). The shape of the learning curves indicates that increasing the training dataset size beyond 835 samples is unlikely to significantly reduce the balanced error rates further.

CONCLUSIONS

Learning curves are a simple tool that can be used to determine the minimum sample size required for future prediction modelling studies of different malaria outcomes that use machine learning algorithms for prediction. These curves need to be generated for each specific prediction modelling application.

摘要

背景

机器学习算法已被用于预测疟疾风险和严重程度、识别疟疾疫苗候选物的免疫生物标志物以及确定抗疟药物耐药性的分子生物标志物。开发这些预测模型需要大量的训练数据集,以确保在应用于目标人群中的新个体时的预测准确性。学习曲线可用于通过评估使用不同数据集大小训练的模型的预测性能来评估训练数据集所需的样本大小。这些曲线与特定的预测模型无关,但其构建确实需要现有数据。本教程演示了如何为使用机器学习算法开发的疟疾预测模型生成和解释学习曲线。

方法

为了说明该方法,评估了训练数据集大小,以为一项“模拟”预测建模研究的设计提供信息,该研究旨在根据基因表达数据预测恶性疟原虫疟疾分离株的青蒿素耐药状态。数据基于先前发表的体内寄生虫基因表达数据集进行模拟,该数据集包含来自急性疟疾患者的1043株恶性疟原虫分离株的转录组,其中29%(299/1043)来自清除缓慢的感染(寄生虫清除半衰期>5小时)。为两种机器学习算法生成了学习曲线,即稀疏偏最小二乘判别分析加支持向量机(sPLSDA + SVMs)和随机森林。使用平衡错误率(将清除缓慢的感染错误预测为快速清除的感染百分比与将快速清除的感染预测为缓慢清除的感染百分比的平均值)来衡量预测误差。

结果

对于这项模拟疟疾预测研究,在未用于模型训练的测试数据集(208个样本)上,对于sPLSDA + SVMs,在评估的最小训练数据集(20个样本)上平衡错误率为50%,对于随机森林为50%;在评估的最大训练数据集(835个样本)上,对于sPLSDA + SVMs为14%,对于随机森林为22%。学习曲线的形状表明,将训练数据集大小增加到超过835个样本不太可能进一步显著降低平衡错误率。

结论

学习曲线是一种简单的工具,可用于确定未来使用机器学习算法进行预测的不同疟疾结果的预测建模研究所需的最小样本大小。这些曲线需要针对每个特定的预测建模应用生成。

相似文献

1
How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.如何使用学习曲线评估利用机器学习算法开发的疟疾预测模型的样本量。
Malar J. 2025 Jul 24;24(1):242. doi: 10.1186/s12936-025-05479-3.
2
Primaquine for reducing Plasmodium falciparum transmission.伯氨喹用于减少恶性疟原虫传播。
Cochrane Database Syst Rev. 2012 Sep 12(9):CD008152. doi: 10.1002/14651858.CD008152.pub2.
3
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
4
Primaquine or other 8-aminoquinoline for reducing P. falciparum transmission.伯氨喹或其他8-氨基喹啉用于减少恶性疟原虫传播。
Cochrane Database Syst Rev. 2014 Jun 30(6):CD008152. doi: 10.1002/14651858.CD008152.pub3.
5
Primaquine or other 8-aminoquinoline for reducing Plasmodium falciparum transmission.伯氨喹或其他8-氨基喹啉用于减少恶性疟原虫传播。
Cochrane Database Syst Rev. 2015 Feb 19(2):CD008152. doi: 10.1002/14651858.CD008152.pub4.
6
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
7
Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能?开发一种互联网应用算法。
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
8
Primaquine or other 8-aminoquinolines for reducing Plasmodium falciparum transmission.用于减少恶性疟原虫传播的伯氨喹或其他8-氨基喹啉类药物。
Cochrane Database Syst Rev. 2018 Feb 2;2(2):CD008152. doi: 10.1002/14651858.CD008152.pub5.
9
Approaches for predicting dairy cattle methane emissions: from traditional methods to machine learning.预测奶牛甲烷排放的方法:从传统方法到机器学习。
J Anim Sci. 2024 Jan 3;102. doi: 10.1093/jas/skae219.
10
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

本文引用的文献

1
MOSim: bulk and single-cell multilayer regulatory network simulator.MOSim:批量和单细胞多层调控网络模拟器。
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf110.
2
Sample Size Requirements for Popular Classification Algorithms in Tabular Clinical Data: Empirical Study.表格临床数据中常用分类算法的样本量要求:实证研究
J Med Internet Res. 2024 Dec 17;26:e60231. doi: 10.2196/60231.
3
SQANTI-reads: a tool for the quality assessment of long read data in multi-sample lrRNA-seq experiments.SQANTI-reads:一种用于多样本长读长核糖体RNA测序实验中长读长数据质量评估的工具。
bioRxiv. 2024 Sep 17:2024.08.23.609463. doi: 10.1101/2024.08.23.609463.
4
Sample size determination for prediction models via learning-type curves.基于学习型曲线的预测模型的样本量确定。
Stat Med. 2024 Jul 20;43(16):3062-3072. doi: 10.1002/sim.10121. Epub 2024 May 27.
5
Host Transcriptional Meta-signatures Reveal Diagnostic Biomarkers for Plasmodium falciparum Malaria.宿主转录元特征揭示恶性疟原虫疟疾的诊断生物标志物。
J Infect Dis. 2024 Aug 16;230(2):e474-e485. doi: 10.1093/infdis/jiae041.
6
Clinical prediction models and the multiverse of madness.临床预测模型与疯狂的多元宇宙。
BMC Med. 2023 Dec 18;21(1):502. doi: 10.1186/s12916-023-03212-y.
7
Sample Size Analysis for Machine Learning Clinical Validation Studies.机器学习临床验证研究的样本量分析
Biomedicines. 2023 Feb 23;11(3):685. doi: 10.3390/biomedicines11030685.
8
The Shape of Learning Curves: A Review.学习曲线的形态:综述
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7799-7819. doi: 10.1109/TPAMI.2022.3220744. Epub 2023 May 5.
9
The role of machine learning in clinical research: transforming the future of evidence generation.机器学习在临床研究中的作用:改变证据生成的未来。
Trials. 2021 Aug 16;22(1):537. doi: 10.1186/s13063-021-05489-x.
10
Developing a multivariate prediction model of antibody features associated with protection of malaria-infected pregnant women from placental malaria.开发与保护疟疾感染孕妇免受胎盘疟疾相关的抗体特征的多元预测模型。
Elife. 2021 Jun 29;10:e65776. doi: 10.7554/eLife.65776.