Far Aryana T, Bastani Asal, Lee Albert, Gologorskaya Oksana, Huang Chiung-Yu, Pletcher Mark J, Lai Jennifer C, Ge Jin
Department of Medicine, Division of Gastroenterology and Hepatology, University of California-San Francisco, San Francisco, California, USA.
Academic Research Services, University of California-San Francisco, San Francisco, California, USA.
Hepatology. 2025 Jun 1;81(6):1753-1763. doi: 10.1097/HEP.0000000000001115. Epub 2024 Oct 8.
Diagnosis code classification is a common method for cohort identification in cirrhosis research, but it is often inaccurate and augmented by labor-intensive chart review. Natural language processing using large language models (LLMs) is a potentially more accurate method. To assess LLMs' potential for cirrhosis cohort identification, we compared code-based versus LLM-based classification with chart review as a "gold standard."
We extracted and conducted a limited chart review of 3788 discharge summaries of cirrhosis admissions. We engineered zero-shot prompts using a Generative Pre-trained Transformer 4 to determine whether cirrhosis and its complications were active hospitalization problems. We calculated positive predictive values (PPVs) of LLM-based classification versus limited chart review and PPVs of code-based versus LLM-based classification as a "silver standard" in all 3788 summaries. Compared to gold standard chart review, code-based classification achieved PPVs of 82.2% for identifying cirrhosis, 41.7% for HE, 72.8% for ascites, 59.8% for gastrointestinal bleeding, and 48.8% for spontaneous bacterial peritonitis. Compared to the chart review, Generative Pre-trained Transformer 4 achieved 87.8%-98.8% accuracies for identifying cirrhosis and its complications. Using LLM as a silver standard, code-based classification achieved PPVs of 79.8% for identifying cirrhosis, 53.9% for HE, 55.3% for ascites, 67.6% for gastrointestinal bleeding, and 65.5% for spontaneous bacterial peritonitis.
LLM-based classification was highly accurate versus manual chart review in identifying cirrhosis and its complications. This allowed us to assess the performance of code-based classification at scale using LLMs as a silver standard. These results suggest LLMs could augment or replace code-based cohort classification and raise questions regarding the necessity of chart review.
诊断代码分类是肝硬化研究中进行队列识别的常用方法,但该方法往往不准确,且需要耗费大量人力进行病历审查来加以补充。使用大语言模型(LLMs)的自然语言处理是一种可能更准确的方法。为评估大语言模型在肝硬化队列识别方面的潜力,我们将基于代码的分类与基于大语言模型的分类进行了比较,并将病历审查作为“金标准”。
我们提取了3788份肝硬化住院患者的出院小结,并进行了有限的病历审查。我们使用生成式预训练变换器4设计了零样本提示,以确定肝硬化及其并发症是否为当前住院期间的问题。我们计算了基于大语言模型的分类相对于有限病历审查的阳性预测值(PPV),以及基于代码的分类相对于基于大语言模型的分类在所有3788份小结中的PPV,将基于大语言模型的分类作为“银标准”。与金标准病历审查相比,基于代码的分类在识别肝硬化方面的PPV为82.2%,肝性脑病为41.7%,腹水为72.8%,胃肠道出血为59.8%,自发性细菌性腹膜炎为48.8%。与病历审查相比,生成式预训练变换器4在识别肝硬化及其并发症方面的准确率为87.8%-98.8%。以大语言模型作为银标准,基于代码的分类在识别肝硬化方面的PPV为79.8%,肝性脑病为53.9%,腹水为55.3%,胃肠道出血为67.6%,自发性细菌性腹膜炎为65.5%。
在识别肝硬化及其并发症方面,基于大语言模型的分类相对于人工病历审查具有高度准确性。这使我们能够以大语言模型作为银标准来大规模评估基于代码的分类的性能。这些结果表明,大语言模型可以补充或取代基于代码的队列分类,并引发了关于病历审查必要性的问题。