如何使用学习曲线评估利用机器学习算法开发的疟疾预测模型的样本量。

How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.

作者信息

Zaloumis Sophie G, Rajasekhar Megha, Simpson Julie A

机构信息

Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Carlton, VIC, Australia.

MISCH (Methods and Implementation Support for Clinical Health) Research Hub, Faculty of Medicine, Dentistry, and Health Sciences, University of Melbourne, Carlton, VIC, Australia.

出版信息

Malar J. 2025 Jul 24;24(1):242. doi: 10.1186/s12936-025-05479-3.

DOI:10.1186/s12936-025-05479-3

PMID:40708012

Abstract

BACKGROUND

Machine learning algorithms have been used to predict malaria risk and severity, identify immunity biomarkers for malaria vaccine candidates, and determine molecular biomarkers of antimalarial drug resistance. Developing these prediction models requires large training datasets to ensure prediction accuracy when applied to new individuals in the target population. Learning curves can be used to assess the sample size required for the training dataset by evaluating the predictive performance of a model trained using different dataset sizes. These curves are agnostic to the specific prediction model, but their construction does require existing data. This tutorial demonstrates how to generate and interpret learning curves for malaria prediction models developed using machine learning algorithms.

METHODS

To illustrate the approach, training dataset sizes were evaluated to inform the design of a "mock" prediction modelling study aimed to predict the artemisinin resistance status of Plasmodium falciparum malaria isolates from gene expression data. Data were simulated based on a previously published in vivo parasite gene expression dataset, which contained transcriptomes of 1043 P. falciparum isolates from patients with acute malaria, of which 29% (299/1043) were from slow clearing infections (parasite clearance half-life > 5 h). Learning curves were produced for two machine learning algorithms, sparse Partial Least Squares-Discriminant Analysis plus Support Vector Machines (sPLSDA + SVMs) and random forests. Prediction error was measured using the balanced error rate (average of percentage of slow clearing infections incorrectly predicted as fast and percentage of fast clearing infections predicted as slow).

RESULTS

For this mock malaria prediction study, the balanced error rate on a test dataset not used for model training (208 samples) was 50% for sPLSDA + SVMs and 50% for random forests on the smallest training dataset evaluated (20 samples) and 14% for sPLSDA + SVMs and 22% for random forests on the largest training dataset evaluated (835 samples). The shape of the learning curves indicates that increasing the training dataset size beyond 835 samples is unlikely to significantly reduce the balanced error rates further.

CONCLUSIONS

Learning curves are a simple tool that can be used to determine the minimum sample size required for future prediction modelling studies of different malaria outcomes that use machine learning algorithms for prediction. These curves need to be generated for each specific prediction modelling application.

摘要

背景

机器学习算法已被用于预测疟疾风险和严重程度、识别疟疾疫苗候选物的免疫生物标志物以及确定抗疟药物耐药性的分子生物标志物。开发这些预测模型需要大量的训练数据集，以确保在应用于目标人群中的新个体时的预测准确性。学习曲线可用于通过评估使用不同数据集大小训练的模型的预测性能来评估训练数据集所需的样本大小。这些曲线与特定的预测模型无关，但其构建确实需要现有数据。本教程演示了如何为使用机器学习算法开发的疟疾预测模型生成和解释学习曲线。

方法

为了说明该方法，评估了训练数据集大小，以为一项“模拟”预测建模研究的设计提供信息，该研究旨在根据基因表达数据预测恶性疟原虫疟疾分离株的青蒿素耐药状态。数据基于先前发表的体内寄生虫基因表达数据集进行模拟，该数据集包含来自急性疟疾患者的1043株恶性疟原虫分离株的转录组，其中29%（299/1043）来自清除缓慢的感染（寄生虫清除半衰期>5小时）。为两种机器学习算法生成了学习曲线，即稀疏偏最小二乘判别分析加支持向量机（sPLSDA + SVMs）和随机森林。使用平衡错误率（将清除缓慢的感染错误预测为快速清除的感染百分比与将快速清除的感染预测为缓慢清除的感染百分比的平均值）来衡量预测误差。

结果

对于这项模拟疟疾预测研究，在未用于模型训练的测试数据集（208个样本）上，对于sPLSDA + SVMs，在评估的最小训练数据集（20个样本）上平衡错误率为50%，对于随机森林为50%；在评估的最大训练数据集（835个样本）上，对于sPLSDA + SVMs为14%，对于随机森林为22%。学习曲线的形状表明，将训练数据集大小增加到超过835个样本不太可能进一步显著降低平衡错误率。