Suppr超能文献

使用大语言模型从自由文本中提取新冠病毒-19传播情况

Extracting circumstances of Covid-19 transmission from free text with large language models.

作者信息

Bizel-Bizellot Gaston, Galmiche Simon, Lelandais Benoît, Charmet Tiffany, Coudeville Laurent, Fontanet Arnaud, Zimmer Christophe

机构信息

Institut Pasteur, Université Paris Cité, Imaging and Modeling Unit, Paris, France.

Institut Pasteur, Université Paris Cité, Epidemiology of Emerging Diseases Unit, Paris, France.

出版信息

Nat Commun. 2025 Jul 1;16(1):5836. doi: 10.1038/s41467-025-60762-w.

Abstract

Identifying the circumstances of transmission of an emerging infectious disease rapidly is central for mitigation efforts. Here, we explore how large language models (LLMs) can automatically extract such circumstances from free-text descriptions in online surveys, in the context of Covid-19. In a nationwide study conducted online in France, we enrolled 545,958 adults with recent SARS-CoV-2 infection and inquired about the circumstances of transmission in both closed-ended and open-ended questions. First, we trained a classification model based on a pretrained LLM to predict one of seven predefined infection contexts (Work, Family, Friends, Sports, Cultural, Religious, Other) from the free text in answers to open-ended questions. We achieved an unbalanced accuracy of 75%, which increased to 91% when eliminating the 43% highest entropy responses. Second, we used topic modeling to define clusters of transmission circumstances agnostically. This led to 23 clusters, which agreed with the seven predefined infection contexts, but also provided finer details on previously undefined circumstances of transmission. Our study suggests that LLM-based analysis of free text may alleviate the need for closed-ended questions in epidemiological surveys and enable insights into previously unsuspected circumstances of transmission. This approach is poised to accelerate and enrich the acquisition of epidemiological insights in future pandemics.

摘要

迅速确定新发传染病的传播情况对于缓解措施至关重要。在此,我们探讨大语言模型(LLMs)如何在新冠疫情背景下,从在线调查中的自由文本描述中自动提取此类情况。在法国进行的一项全国性在线研究中,我们招募了545,958名近期感染新冠病毒的成年人,并通过封闭式和开放式问题询问传播情况。首先,我们基于预训练的大语言模型训练了一个分类模型,以从开放式问题答案中的自由文本预测七个预定义感染情境(工作、家庭、朋友、运动、文化、宗教、其他)之一。我们实现了75%的不平衡准确率,在排除43%熵值最高的回答后,准确率提高到了91%。其次,我们使用主题建模来无差别地定义传播情况的类别。这产生了23个类别,它们与七个预定义感染情境相符,但也提供了关于先前未定义传播情况的更详细信息。我们的研究表明,基于大语言模型的自由文本分析可能会减少流行病学调查中对封闭式问题的需求,并能够洞察先前未被怀疑的传播情况。这种方法有望加速并丰富未来疫情中流行病学见解的获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验