Hou Zhen, Liu Hao, Bian Jiang, He Xing, Zhuang Yan
Department of Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Indianapolis, IN USA.
School of Computing, College of Science and Mathematics, Montclair State University, Montclair, NJ USA.
Npj Health Syst. 2025;2(1):14. doi: 10.1038/s44401-025-00018-3. Epub 2025 May 1.
Medical coding is essential for healthcare operations yet remains predominantly manual, error-prone (up to 20%), and costly (up to $18.2 billion annually). Although large language models (LLMs) have shown promise in natural language processing, their application to medical coding has produced limited accuracy. In this study, we evaluated whether fine-tuning LLMs with specialized ICD-10 knowledge can automate code generation across clinical documentation. We adopted a two-phase approach: initial fine-tuning using 74,260 ICD-10 code-description pairs, followed by enhanced training to address linguistic and lexical variations. Evaluations using a proprietary model (GPT-4o mini) on a cloud platform and an open-source model (Llama) on local GPUs demonstrated that initial fine-tuning increased exact matching from <1% to 97%, while enhanced fine-tuning further improved performance in complex scenarios, with real-world clinical notes achieving 69.20% exact match and 87.16% category match. These findings indicate that domain-specific fine-tuned LLMs can reduce manual burdens and improve reliability.
医学编码对医疗保健运营至关重要,但目前仍主要依赖人工操作,容易出错(错误率高达20%)且成本高昂(每年高达182亿美元)。尽管大语言模型(LLMs)在自然语言处理方面已展现出潜力,但其在医学编码中的应用准确性有限。在本研究中,我们评估了使用专门的ICD - 10知识对大语言模型进行微调是否能够在临床文档中自动生成编码。我们采用了两阶段方法:首先使用74,260个ICD - 10编码 - 描述对进行初始微调,然后进行强化训练以解决语言和词汇变化问题。在云平台上使用专有模型(GPT - 4o mini)以及在本地GPU上使用开源模型(Llama)进行的评估表明,初始微调将精确匹配率从<1%提高到了97%,而强化微调在复杂场景中进一步提升了性能,在真实世界临床记录中实现了69.20%的精确匹配和87.16%的类别匹配。这些发现表明,特定领域的微调大语言模型可以减轻人工负担并提高可靠性。