使用思维链微调的大语言模型从肺癌手术病理报告中进行自动病理TN分类预测及依据生成：算法开发与验证研究

Automated Pathologic TN Classification Prediction and Rationale Generation From Lung Cancer Surgical Pathology Reports Using a Large Language Model Fine-Tuned With Chain-of-Thought: Algorithm Development and Validation Study.

作者信息

Kim Sanghwan, Jang Sowon, Kim Borham, Sunwoo Leonard, Kim Seok, Chung Jin-Haeng, Nam Sejin, Cho Hyeongmin, Lee Donghyoung, Lee Keehyuck, Yoo Sooyoung

机构信息

ezCaretech Research & Development Center, Seoul, Republic of Korea.

Department of Radiology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea.

出版信息

JMIR Med Inform. 2024 Dec 20;12:e67056. doi: 10.2196/67056.

DOI:10.2196/67056

PMID:39705675

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11699504/

Abstract

BACKGROUND

Traditional rule-based natural language processing approaches in electronic health record systems are effective but are often time-consuming and prone to errors when handling unstructured data. This is primarily due to the substantial manual effort required to parse and extract information from diverse types of documentation. Recent advancements in large language model (LLM) technology have made it possible to automatically interpret medical context and support pathologic staging. However, existing LLMs encounter challenges in rapidly adapting to specialized guideline updates. In this study, we fine-tuned an LLM specifically for lung cancer pathologic staging, enabling it to incorporate the latest guidelines for pathologic TN classification.

OBJECTIVE

This study aims to evaluate the performance of fine-tuned generative language models in automatically inferring pathologic TN classifications and extracting their rationale from lung cancer surgical pathology reports. By addressing the inefficiencies and extensive parsing efforts associated with rule-based methods, this approach seeks to enable rapid and accurate reclassification aligned with the latest cancer staging guidelines.

METHODS

We conducted a comparative performance evaluation of 6 open-source LLMs for automated TN classification and rationale generation, using 3216 deidentified lung cancer surgical pathology reports based on the American Joint Committee on Cancer (AJCC) Cancer Staging Manual8th edition, collected from a tertiary hospital. The dataset was preprocessed by segmenting each report according to lesion location and morphological diagnosis. Performance was assessed using exact match ratio (EMR) and semantic match ratio (SMR) as evaluation metrics, which measure classification accuracy and the contextual alignment of the generated rationales, respectively.

RESULTS

Among the 6 models, the Orca2_13b model achieved the highest performance with an EMR of 0.934 and an SMR of 0.864. The Orca2_7b model also demonstrated strong performance, recording an EMR of 0.914 and an SMR of 0.854. In contrast, the Llama2_7b model achieved an EMR of 0.864 and an SMR of 0.771, while the Llama2_13b model showed an EMR of 0.762 and an SMR of 0.690. The Mistral_7b and Llama3_8b models, on the other hand, showed lower performance, with EMRs of 0.572 and 0.489, and SMRs of 0.377 and 0.456, respectively. Overall, the Orca2 models consistently outperformed the others in both TN stage classification and rationale generation.

CONCLUSIONS

The generative language model approach presented in this study has the potential to enhance and automate TN classification in complex cancer staging, supporting both clinical practice and oncology data curation. With additional fine-tuning based on cancer-specific guidelines, this approach can be effectively adapted to other cancer types.

摘要

背景

电子健康记录系统中传统的基于规则的自然语言处理方法是有效的，但在处理非结构化数据时通常很耗时且容易出错。这主要是因为需要大量的人工来从各种类型的文档中解析和提取信息。大语言模型（LLM）技术的最新进展使得自动解读医学背景和支持病理分期成为可能。然而，现有的大语言模型在快速适应专门的指南更新方面面临挑战。在本研究中，我们专门针对肺癌病理分期对一个大语言模型进行了微调，使其能够纳入病理TN分类的最新指南。

目的

本研究旨在评估微调后的生成式语言模型在自动推断病理TN分类并从肺癌手术病理报告中提取其依据方面的性能。通过解决与基于规则的方法相关的低效和大量解析工作，这种方法旨在实现与最新癌症分期指南一致的快速准确的重新分类。

方法

我们使用从一家三级医院收集的3216份基于美国癌症联合委员会（AJCC）癌症分期手册第8版去识别的肺癌手术病理报告，对6个开源大语言模型进行了自动TN分类和依据生成的比较性能评估。数据集通过根据病变位置和形态学诊断对每份报告进行分割来进行预处理。使用精确匹配率（EMR）和语义匹配率（SMR）作为评估指标来评估性能，这两个指标分别衡量分类准确性和生成依据的上下文一致性。

结果

在这6个模型中，Orca2_13b模型表现最佳，EMR为0.934，SMR为0.864。Orca2_7b模型也表现出很强的性能，EMR为0.914，SMR为0.854。相比之下，Llama2_7b模型的EMR为0.864，SMR为0.771，而Llama2_13b模型的EMR为0.762，SMR为0.690。另一方面，Mistral_7b和Llama3_8b模型表现较差，EMR分别为0.572和0.489，SMR分别为0.377和0.456。总体而言，Orca2模型在TN分期分类和依据生成方面始终优于其他模型。