Suppr超能文献

大语言模型在挖掘电子健康记录数据中的变革潜力:内容分析

The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.

作者信息

Wals Zurita Amadeo Jesus, Miras Del Rio Hector, Ugarte Ruiz de Aguirre Nerea, Nebrera Navarro Cristina, Rubio Jimenez Maria, Muñoz Carmona David, Miguez Sanchez Carlos

机构信息

Servicio Oncologia Radioterápica, Hospital Universitario Virgen Macarena, Andalusian Health Service, Seville, Spain.

出版信息

JMIR Med Inform. 2025 Jan 2;13:e58457. doi: 10.2196/58457.

Abstract

BACKGROUND

In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

OBJECTIVE

We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

METHODS

We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, κ index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models.

RESULTS

The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives.

CONCLUSIONS

This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.

摘要

背景

在本研究中,我们评估了大语言模型从自由文本临床报告中提取和构建信息的准确性、效率和成本效益,特别是在肿瘤电子健康记录中识别和分类患者合并症方面。我们特别比较了gpt - 3.5 - turbo - 1106和gpt - 4 - 1106 - preview模型与专业人类评估者的性能。

目的

我们特别比较了gpt - 3.5 - turbo - 1106和gpt - 4 - 1106 - preview模型与专业人类评估者的性能。

方法

我们使用OpenAI应用程序编程接口实现了一个脚本,以从250份个人病史报告中报告的合并症中提取JavaScript对象表示法格式的结构化信息。这些报告由5名放射肿瘤学专家分批进行人工审核,每次审核50份。除了检查人类和生成式预训练变换器(GPT)模型中错误的常见原因外,我们还使用敏感性、特异性、精确度、准确性、F值、κ指数和McNemar检验等指标比较了结果。

结果

在所有指标上,GPT - 3.5模型的表现略低于医生,不过差异无统计学意义(McNemar检验,P = 0.79)。GPT - 4在几个关键指标上表现出明显优势(McNemar检验,P < 0.001)。值得注意的是,它的敏感性达到了96.8%,而GPT - 3.5为88.2%,医生为88.8%。然而,医生在精确度上略优于GPT - 4(97.7%对96.8%)。GPT - 4表现出更高的一致性,在10次重复分析中,76%的报告得到了完全相同的结果,而GPT - 3.5为59%,这表明其性能更稳定、更可靠。医生更容易遗漏明确的合并症,而GPT模型更频繁地推断出不明确的合并症,有时推断正确,但这也导致了更多的假阳性。

结论

本研究表明,通过精心设计提示,所研究的大语言模型在从复杂临床报告中提取信息方面可以与医学专家相匹配甚至超越他们。它们在时间和成本方面的卓越效率,以及与数据库的轻松集成,使其成为大规模数据挖掘和生成真实世界证据的宝贵工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3a7/11739723/6cfee737949c/medinform_v13i1e58457_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验