通过自动化机器学习用于流行病学研究的大语言模型：使用英国国家儿童发展研究数据的案例研究

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study.

作者信息

Wibaek Rasmus, Andersen Gregers Stig, Dahm Christina C, Witte Daniel R, Hulman Adam

机构信息

Steno Diabetes Center Copenhagen, Herlev, Denmark.

Department of Public Health, Aarhus University, Aarhus, Denmark.

出版信息

JMIR Med Inform. 2023 Sep 19;11:e43638. doi: 10.2196/43638.

DOI:10.2196/43638

PMID:37787655

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10547934/

Abstract

BACKGROUND

Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to the analysis of electronic health records and social media data.

OBJECTIVES

To demonstrate the potential of NLP beyond these domains, we aimed to develop prediction models based on texts collected from an epidemiological cohort and compare their performance to classical regression methods.

METHODS

We used data from the British National Child Development Study, where 10,567 children aged 11 years wrote essays about how they imagined themselves as 25-year-olds. Overall, 15% of the data set was set aside as a test set for performance evaluation. Pretrained language models were fine-tuned using AutoTrain (Hugging Face) to predict current reading comprehension score (range: 0-35) and future BMI and physical activity (active vs inactive) at the age of 33 years. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models, including demographic and lifestyle factors of the parents and children from birth to the age of 11 years as predictors.

RESULTS

NLP clearly outperformed linear regression when predicting reading comprehension scores (root mean square error: 3.89, 95% CI 3.74-4.05 for NLP vs 4.14, 95% CI 3.98-4.30 and 5.41, 95% CI 5.23-5.58 for regression models with and without general ability score as a predictor, respectively). Predictive performance for physical activity was similarly poor for the 2 methods (area under the receiver operating characteristic curve: 0.55, 95% CI 0.52-0.60 for both) but was slightly better than random assignment, whereas linear regression clearly outperformed the NLP approach when predicting BMI (root mean square error: 4.38, 95% CI 4.02-4.74 for NLP vs 3.85, 95% CI 3.54-4.16 for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as a predictor.

CONCLUSIONS

Our study demonstrated the potential of using large language models on text collected from epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to the outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies.

摘要

背景

近年来，大型语言模型对自然语言处理（NLP）产生了巨大影响。然而，它们在流行病学研究中的应用仍局限于电子健康记录和社交媒体数据的分析。

目的

为了证明NLP在这些领域之外的潜力，我们旨在基于从一个流行病学队列收集的文本开发预测模型，并将其性能与经典回归方法进行比较。

方法

我们使用了英国国家儿童发展研究的数据，其中10567名11岁的儿童撰写了关于他们如何想象自己25岁时的文章。总体而言，15%的数据集被留作性能评估的测试集。使用AutoTrain（Hugging Face）对预训练语言模型进行微调，以预测33岁时的当前阅读理解分数（范围：0 - 35）以及未来的体重指数（BMI）和身体活动情况（活跃与不活跃）。然后，我们将它们的预测性能（准确性或区分度）与线性和逻辑回归模型进行比较，将父母和孩子从出生到11岁的人口统计学和生活方式因素作为预测变量。

结果

在预测阅读理解分数时，NLP明显优于线性回归（均方根误差：NLP为3.89，95%置信区间3.74 - 4.05；对于以一般能力分数为预测变量和不以一般能力分数为预测变量的回归模型，分别为4.14，95%置信区间3.98 - 4.30和5.41，95%置信区间5.23 - 5.58）。两种方法对身体活动的预测性能同样较差（受试者工作特征曲线下面积：两者均为0.55，95%置信区间0.52 - 0.60），但略优于随机分配，而在预测BMI时，线性回归明显优于NLP方法（均方根误差：NLP为4.38，95%置信区间4.02 - 4.74；回归为3.85，95%置信区间3.54 - 4.16）。NLP方法的表现并不比简单地将训练集中的平均BMI作为预测变量更好。