Suppr超能文献

使用大语言模型从公开可用来源自动提取死亡率信息:开发与评估研究

Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study.

作者信息

Al-Garadi Mohammed, LeNoue-Newton Michele, Matheny Michael E, McPheeters Melissa, Whitaker Jill M, Deere Jessica A, McLemore Michael F, Westerman Dax, Khan Mirza S, Hernández-Muñoz José J, Wang Xi, Kuzucan Aida, Desai Rishi J, Reeves Ruth

机构信息

Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Avenue, Nashville, TN, 37203, United States, 1 2139151696.

Research Triangle Park, NC, United States.

出版信息

J Med Internet Res. 2025 Aug 18;27:e71113. doi: 10.2196/71113.

Abstract

BACKGROUND

Mortality is a critical variable in health care research, especially for evaluating medical product safety and effectiveness. However, inconsistencies in the availability and timeliness of death date and cause of death (CoD) information present significant challenges. Conventional sources such as the National Death Index and electronic health records often experience data lags, missing fields, or incomplete coverage, limiting their utility in time-sensitive or large-scale studies. With the growing use of social media, crowdfunding platforms, and web-based memorials, publicly available digital content has emerged as a potential supplementary source for mortality surveillance. Despite this potential, accurate tools for extracting mortality information from such unstructured data sources remain underdeveloped.

OBJECTIVE

The aim of the study is to develop scalable approaches using natural language processing (NLP) and large language models (LLMs) for the extraction of mortality information from publicly available web-based data sources, including social media platforms, crowdfunding websites, and web-based obituaries, and to evaluate their performance across various sources.

METHODS

Data were collected from public posts on X (formerly known as Twitter), GoFundMe campaigns, memorial websites (EverLoved and TributeArchive), and web-based obituaries from 2015 to 2022, focusing on US-based content relevant to mortality. We developed an NLP pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then used a few-shot learning (FSL) approach with LLMs to identify primary and secondary CoDs. Model performance was assessed using precision, recall, F1-score, and accuracy metrics, with human-annotated labels serving as the reference standard for the transformer-based model and a human adjudicator blinded to the labeling source for the FSL model reference standard.

RESULTS

The best-performing model obtained a microaveraged F1-score of 0.88 (95% CI 0.86-0.90) in extracting mortality information. The FSL-LLM approach demonstrated high accuracy in identifying primary CoD across various web-based sources. For GoFundMe, the FSL-LLM achieved 95.9% accuracy for primary cause identification compared to 97.9% for human annotators. In obituaries, FSL-LLM accuracy was 96.5% for primary causes, while human accuracy was 99%. For memorial websites, FSL-LLM achieved 98% accuracy for primary causes, with human accuracy at 99.5%.

CONCLUSIONS

This study demonstrates the feasibility of using advanced NLP and LLM techniques to extract mortality data from publicly available web-based sources. These methods can significantly enhance the timeliness, completeness, and granularity of mortality surveillance, offering a valuable complement to traditional data systems. By enabling earlier detection of mortality signals and improving CoD classification across large populations, this approach may support more responsive public health monitoring and medical product safety assessments. Further work is needed to validate these findings in real-world health care settings and facilitate the integration of digital data sources into national public health surveillance systems.

摘要

背景

死亡率是医疗保健研究中的一个关键变量,特别是在评估医疗产品的安全性和有效性时。然而,死亡日期和死因(CoD)信息的可得性和及时性不一致带来了重大挑战。诸如国家死亡指数和电子健康记录等传统来源经常存在数据滞后、字段缺失或覆盖不完整的问题,限制了它们在时间敏感型或大规模研究中的效用。随着社交媒体、众筹平台和网络纪念网站的使用日益增加,公开可用的数字内容已成为死亡率监测的潜在补充来源。尽管有这种潜力,但用于从此类非结构化数据源中提取死亡率信息的准确工具仍未得到充分发展。

目的

本研究的目的是开发可扩展的方法,使用自然语言处理(NLP)和大语言模型(LLMs)从公开可用的基于网络的数据源(包括社交媒体平台、众筹网站和网络讣告)中提取死亡率信息,并评估它们在各种来源上的性能。

方法

收集了2015年至2022年期间X(前身为Twitter)上的公开帖子、GoFundMe活动、纪念网站(EverLoved和TributeArchive)以及网络讣告中的数据,重点关注与美国相关的死亡率内容。我们使用基于Transformer的模型开发了一个NLP管道,以提取关键的死亡率信息,如死者姓名、出生日期和死亡日期。然后,我们使用带有LLMs的少样本学习(FSL)方法来识别主要和次要死因。使用精确率、召回率、F1分数和准确率指标评估模型性能,以人工标注的标签作为基于Transformer的模型的参考标准,以对标注来源不知情的人工裁决者作为FSL模型参考标准。

结果

表现最佳的模型在提取死亡率信息方面获得了0.88(95%CI 0.86 - 0.90)的微平均F1分数。FSL - LLM方法在识别各种基于网络的来源中的主要死因方面表现出很高的准确性。对于GoFundMe,FSL - LLM在主要原因识别方面的准确率为95.9%,而人工标注者的准确率为97.9%。在讣告中,FSL - LLM在主要原因方面的准确率为96.5%,而人工准确率为99%。对于纪念网站,FSL - LLM在主要原因方面的准确率为98%,人工准确率为99.5%。

结论

本研究证明了使用先进的NLP和LLM技术从公开可用的基于网络的来源中提取死亡率数据的可行性。这些方法可以显著提高死亡率监测的及时性、完整性和粒度,为传统数据系统提供有价值的补充。通过能够更早地检测到死亡率信号并改善大规模人群中的死因分类,这种方法可能支持更及时的公共卫生监测和医疗产品安全评估。需要进一步的工作在实际医疗保健环境中验证这些发现,并促进将数字数据源整合到国家公共卫生监测系统中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f836/12359966/822b482faddc/jmir-v27-e71113-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验