用于临床文档分类的大语言模型与人类对比

Large language models vs human for classifying clinical documents.

作者信息

Mustafa Akram, Naseem Usman, Rahimi Azghadi Mostafa

机构信息

College of Science and Engineering, James Cook University, Townsville, 4811, QLD, Australia.

School of Computing, Macquarie University, Sydney, 2113, NSW, Australia.

出版信息

Int J Med Inform. 2025 Mar;195:105800. doi: 10.1016/j.ijmedinf.2025.105800. Epub 2025 Jan 21.

DOI:10.1016/j.ijmedinf.2025.105800

PMID:39848078

Abstract

BACKGROUND

Accurate classification of medical records is crucial for clinical documentation, particularly when using the 10th revision of the International Classification of Diseases (ICD-10) coding system. The use of machine learning algorithms and Systematized Nomenclature of Medicine (SNOMED) mapping has shown promise in performing these classifications. However, challenges remain, particularly in reducing false negatives, where certain diagnoses are not correctly identified by either approach.

OBJECTIVE

This study explores the potential of leveraging advanced large language models to improve the accuracy of ICD-10 classifications in challenging cases of medical records where machine learning and SNOMED mapping fail.

METHODS

We evaluated the performance of ChatGPT 3.5 and ChatGPT 4 in classifying ICD-10 codes from discharge summaries within selected records of the Medical Information Mart for Intensive Care (MIMIC) IV dataset. These records comprised 802 discharge summaries identified as false negatives by both machine learning and SNOMED mapping methods, showing their challenging case. Each summary was assessed by ChatGPT 3.5 and 4 using a classification prompt, and the results were compared to human coder evaluations. Five human coders, with a combined experience of over 30 years, independently classified a stratified sample of 100 summaries to validate ChatGPT's performance.

RESULTS

ChatGPT 4 demonstrated significantly improved consistency over ChatGPT 3.5, with matching results between runs ranging from 86% to 89%, compared to 57% to 67% for ChatGPT 3.5. The classification accuracy of ChatGPT 4 was variable across different ICD-10 codes. Overall, human coders performed better than ChatGPT. However, ChatGPT matched the median performance of human coders, achieving an accuracy rate of 22%.

CONCLUSION

This study underscores the potential of integrating advanced language models with clinical coding processes to improve documentation accuracy. ChatGPT 4 demonstrated improved consistency and comparable performance to median human coders, achieving 22% accuracy in challenging cases. Combining ChatGPT with methods like SNOMED mapping could further enhance clinical coding accuracy, particularly for complex scenarios.

摘要

背景

准确分类医疗记录对于临床文档至关重要，尤其是在使用国际疾病分类第十版（ICD - 10）编码系统时。机器学习算法和医学系统命名法（SNOMED）映射的使用在进行这些分类方面已显示出前景。然而，挑战依然存在，特别是在减少假阴性方面，即某些诊断无法通过这两种方法正确识别。

目的

本研究探讨在机器学习和SNOMED映射失败的具有挑战性的医疗记录案例中，利用先进的大语言模型提高ICD - 10分类准确性的潜力。

方法

我们评估了ChatGPT 3.5和ChatGPT 4在对重症监护医学信息集市（MIMIC）IV数据集选定记录中的出院小结进行ICD - 10编码分类方面的性能。这些记录包括802份被机器学习和SNOMED映射方法均识别为假阴性的出院小结，显示出它们的挑战性。每个小结由ChatGPT 3.5和4使用分类提示进行评估，并将结果与人工编码员的评估进行比较。五名人工编码员，总经验超过30年，独立对100份小结的分层样本进行分类以验证ChatGPT的性能。

结果

ChatGPT 4表现出比ChatGPT 3.5显著更高的一致性，各轮匹配结果在86%至89%之间，而ChatGPT 3.5为57%至67%。ChatGPT 4的分类准确性在不同的ICD - 10编码中有所不同。总体而言，人工编码员的表现优于ChatGPT。然而，ChatGPT达到了人工编码员的中位数表现，准确率为22%。