Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, New York.
Department of Epidemiology, Columbia University Mailman School of Public Health, New York.
JAMA Netw Open. 2024 Aug 1;7(8):e2425981. doi: 10.1001/jamanetworkopen.2024.25981.
Large language models (LLMs) have potential to increase the efficiency of information extraction from unstructured clinical notes in electronic medical records.
To assess the utility and reliability of an LLM, ChatGPT-4 (OpenAI), to analyze clinical narratives and identify helmet use status of patients injured in micromobility-related accidents.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used publicly available, deidentified 2019 to 2022 data from the US Consumer Product Safety Commission's National Electronic Injury Surveillance System, a nationally representative stratified probability sample of 96 hospitals in the US. Unweighted estimates of e-bike, bicycle, hoverboard, and powered scooter-related injuries that resulted in an emergency department visit were used. Statistical analysis was performed from November 2023 to April 2024.
Patient helmet status (wearing vs not wearing vs unknown) was extracted from clinical narratives using (1) a text string search using researcher-generated text strings and (2) the LLM by prompting the system with low-, intermediate-, and high-detail prompts. The level of agreement between the 2 approaches across all 3 prompts was analyzed using Cohen κ test statistics. Fleiss κ was calculated to measure the test-retest reliability of the high-detail prompt across 5 new chat sessions and days. Performance statistics were calculated by comparing results from the high-detail prompt to classifications of helmet status generated by researchers reading the clinical notes (ie, a criterion standard review).
Among 54 569 clinical notes, moderate (Cohen κ = 0.74 [95% CI, 0.73-0.75) and weak (Cohen κ = 0.53 [95% CI, 0.52-0.54]) agreement were found between the text string-search approach and the LLM for the low- and intermediate-detail prompts, respectively. The high-detail prompt had almost perfect agreement (κ = 1.00 [95% CI, 1.00-1.00]) but required the greatest amount of time to complete. The LLM did not perfectly replicate its analyses across new sessions and days (Fleiss κ = 0.91 across 5 trials; P < .001). The LLM often hallucinated and was consistent in replicating its hallucinations. It also showed high validity compared with the criterion standard (n = 400; κ = 0.98 [95% CI, 0.96-1.00]).
This study's findings suggest that although there are efficiency gains for using the LLM to extract information from clinical notes, the inadequate reliability compared with a text string-search approach, hallucinations, and inconsistent performance significantly hinder the potential of the currently available LLM.
大型语言模型(LLM)具有提高从电子病历中的非结构化临床记录中提取信息效率的潜力。
评估 LLM(OpenAI 的 ChatGPT-4)分析临床叙述并识别与微移动相关事故受伤患者头盔使用状态的效用和可靠性。
设计、设置和参与者:这是一项使用来自美国消费品安全委员会国家电子伤害监测系统的 2019 年至 2022 年公开的、去识别的、来自美国 96 家医院的全国代表性分层概率样本的横断面研究。使用的是导致急诊科就诊的电动自行车、自行车、 hoverboard 和电动滑板车相关伤害的未加权估计数。使用(1)使用研究人员生成的文本字符串进行文本字符串搜索,以及(2)通过向系统提供低、中和高详细程度的提示来使用 LLM,从临床叙述中提取患者头盔状态(佩戴、未佩戴或未知)。使用 Cohen κ 测试统计数据分析两种方法在所有 3 个提示中的一致性。使用 Fleiss κ 计算在 5 个新的聊天会话和日子中对高详细提示的测试 - 再测试可靠性。通过将高详细提示的结果与研究人员阅读临床记录生成的头盔状态分类(即标准参考审查)进行比较,计算了性能统计数据。
在 54569 份临床记录中,低细节提示和中细节提示的文本字符串搜索方法和 LLM 之间的一致性为中度(Cohen κ=0.74 [95% CI,0.73-0.75)和弱(Cohen κ=0.53 [95% CI,0.52-0.54])。高细节提示具有几乎完美的一致性(κ=1.00 [95% CI,1.00-1.00]),但需要完成的时间最长。LLM 在新会话和日子中并未完全复制其分析结果(5 次试验中的 Fleiss κ=0.91;P<.001)。LLM 经常产生幻觉,并一致复制其幻觉。与标准参考相比,它还显示出很高的有效性(n=400;κ=0.98 [95% CI,0.96-1.00])。
本研究结果表明,尽管使用 LLM 从临床记录中提取信息具有提高效率的优势,但与文本字符串搜索方法相比,其可靠性不足、产生幻觉以及性能不一致严重阻碍了当前可用 LLM 的潜力。