Klang Eyal, Tessler Idit, Apakama Donald U, Abbott Ethan, Glicksberg Benjamin S, Arnold Monique, Moses Akini, Sakhuja Ankit, Soroush Ali, Charney Alexander W, Reich David L, McGreevy Jolion, Gavin Nicholas, Carr Brendan, Freeman Robert, Nadkarni Girish N
medRxiv. 2024 Oct 17:2024.10.15.24315526. doi: 10.1101/2024.10.15.24315526.
Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records.
Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs' predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity.
RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers' codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers' codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%.
RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.
This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
A study showed AI models with retrieval-augmented generation outperformed human doctors in ED diagnostic coding accuracy and specificity. Even smaller AI models perform favorably when using RAG. This suggests potential for reducing administrative burden in healthcare, improving coding efficiency, and enhancing clinical documentation.
准确的医学编码对于临床和管理目的至关重要,但过程复杂、耗时且存在偏差。本研究比较了检索增强生成(RAG)增强的大语言模型(LLM)与医疗机构工作人员分配的代码,以从急诊科(ED)临床记录中生成ICD - 10 - CM代码。
采用回顾性队列研究,从西奈山医疗系统2024年1月至4月随机选取500例急诊就诊病例。RAG系统将过去1,038,066例急诊就诊数据(2021 - 2023年)整合到LLM的预测中,以提高编码准确性。对9个商业和开源LLM进行了评估。主要结果是对RAG增强的LLM生成的ICD - 10 - CM代码与原始医疗机构工作人员分配的代码进行直接比较。由4名医生和2个LLM组成的小组对代码进行盲审,比较RAG增强的LLM和医疗机构工作人员分配的代码在准确性和特异性方面的表现。
RAG增强的LLM在代码分配的准确性和特异性方面均表现出优于医疗机构编码人员的性能。在对GPT - 4与医疗机构工作人员分配的代码存在差异的200例病例进行的针对性评估中,人工评审者在447例中认为GPT - 4的准确性更高,而认为医疗机构工作人员的代码更优的情况为277例(p<0.001)。同样,在509例中GPT - 4因其更高的特异性而被选中,而人工编码人员仅在181例中被优先选择(p<0.001)。较小的开放获取模型,如Llama - 3.1 - 70B,在通过RAG增强后也表现出显著的可扩展性,其准确性被优先选择的情况有218例,而医疗机构工作人员的代码为90例。此外,在所有模型中,LLM生成的代码与医疗机构工作人员分配的代码之间的精确匹配率在整合RAG后显著提高,Qwen - 2 - 7B从0.8%提高到17.6%,Gemma - 2 - 9b - it从7.2%提高到26.4%。
RAG增强的LLM提高了急诊科的医学编码准确性,表明其在临床工作流程中的应用潜力。这些发现表明生成式人工智能可以改善临床结果并减轻管理负担。
这项工作部分得到了西奈山伊坎医学院科学计算与数据提供的计算和数据资源以及工作人员专业知识的支持,并得到了美国国立转化医学推进中心的临床和转化科学奖(CTSA)资助UL1TR004419。本出版物中报告的研究还得到了美国国立卫生研究院研究基础设施办公室授予的S10OD026880和S10OD030463号奖项的支持。内容完全由作者负责,不一定代表美国国立卫生研究院的官方观点。资助者在研究设计、数据收集、数据分析和解释或本稿件的撰写过程中没有发挥任何作用。
一项研究表明,具有检索增强生成功能的人工智能模型在急诊科诊断编码准确性和特异性方面优于人类医生。即使是较小的人工智能模型在使用RAG时也表现出色。这表明在医疗保健领域减轻管理负担、提高编码效率和增强临床文档记录方面具有潜力。