Suppr超能文献

用于探索性剖腹手术记录中手术概念多标签文档分类的语言模型:算法开发研究

Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study.

作者信息

Balch Jeremy A, Desaraju Sasank S, Nolan Victoria J, Vellanki Divya, Buchanan Timothy R, Brinkley Lindsey M, Penev Yordan, Bilgili Ahmet, Patel Aashay, Chatham Corinne E, Vanderbilt David M, Uddin Rayon, Bihorac Azra, Efron Philip, Loftus Tyler J, Rahman Protiva, Shickel Benjamin

机构信息

Department of Surgery, University of Florida College of Medicine, Gainesville, FL, United States.

Department of Health Outcomes and Biomedical Informatics, University of Florida College of Medicine, Gainesville, FL, United States.

出版信息

JMIR Med Inform. 2025 Jul 9;13:e71176. doi: 10.2196/71176.

Abstract

BACKGROUND

Operative notes are frequently mined for surgical concepts in clinical care, research, quality improvement, and billing, often requiring hours of manual extraction. These notes are typically analyzed at the document level to determine the presence or absence of specific procedures or findings (eg, whether a hand-sewn anastomosis was performed or contamination occurred). Extracting several binary classification labels simultaneously is a multilabel classification problem. Traditional natural language processing approaches-bag-of-words (BoW) and term frequency-inverse document frequency (tf-idf) with linear classifiers-have been used previously for this task but are now being augmented or replaced by large language models (LLMs). However, few studies have examined their utility in surgery.

OBJECTIVE

We developed and evaluated LLMs for the purpose of expediting data extraction from surgical notes.

METHODS

A total of 388 exploratory laparotomy notes from a single institution were annotated for 21 concepts related to intraoperative findings, intraoperative techniques, and closure techniques. Annotation consistency was measured using the Cohen κ statistic. Data were preprocessed to include only the description of the procedure. We compared the evolution of document classification technologies from BoW and tf-idf to encoder-only (Clinical-Longformer) and decoder-only (Llama 3) transformer models. Multilabel classification performance was evaluated with 5-fold cross-validation with F1-score and hamming loss (HL). We experimented with and without context. Errors were assessed by manual review. Code and implementation instructions may be found on GitHub.

RESULTS

The prevalence of labels ranged from 0.05 (colostomy, ileostomy, active bleed from named vessel) to 0.50 (running fascial closure). Llama 3.3 was the overall best-performing model (micro F1-score 0.88, 5-fold range: 0.88-0.89; HL 0.11, 5-fold range: 0.11-0.12). The BoW model (micro F1-score 0.68, 5-fold range: 0.64-0.71; HL 0.14, 5-fold range: 0.13-0.16) and Clinical-Longformer (micro F1-score 0.73, 5-fold range: 0.70-0.74; HL 0.11, 5-fold range: 0.10-0.12) had overall similar performance, with tf-idf models trailing (micro F1-score 0.57, 5-fold range: 0.55-0.59; HL 0.27, 5-fold range: 0.25-0.29). F1-scores varied across concepts in the Llama model, ranging from 0.30 (5-fold range: 0.23-0.39) for class III contamination to 0.92 (5-fold range: 0.98-0.84) for bowel resection. Context enhanced Llama's performance, adding an average of 0.16 improvement to the F1-scores. Error analysis demonstrated semantic nuances and edge cases within operative notes, particularly when patients had references to prior operations in their operative notes or simultaneous operations with other surgical services.

CONCLUSIONS

Off-the-shelf autoregressive LLMs outperformed fined-tuned, encoder-only transformers and traditional natural language processing techniques in classifying operative notes. Multilabel classification with LLMs may streamline retrospective reviews in surgery, though further refinements are required prior to reliable use in research and quality improvement.

摘要

背景

手术记录常用于临床护理、研究、质量改进和计费等方面的手术概念挖掘,通常需要数小时的人工提取。这些记录通常在文档层面进行分析,以确定特定手术或发现的存在与否(例如,是否进行了手工缝合吻合术或是否发生污染)。同时提取多个二元分类标签是一个多标签分类问题。传统的自然语言处理方法——词袋模型(BoW)和带有线性分类器的词频-逆文档频率模型(tf-idf),此前已用于此任务,但现在正被大语言模型(LLMs)增强或取代。然而,很少有研究考察它们在手术中的效用。

目的

我们开发并评估大语言模型,以加快从手术记录中提取数据的速度。

方法

对来自单一机构的388份剖腹探查手术记录进行注释,涉及21个与术中发现、术中技术和缝合技术相关的概念。使用科恩κ统计量来衡量注释的一致性。对数据进行预处理,仅包括手术过程的描述。我们比较了从BoW和tf-idf到仅编码器(Clinical-Longformer)和仅解码器(Llama 3)的变换器模型的文档分类技术的演变。通过5折交叉验证,使用F1分数和汉明损失(HL)来评估多标签分类性能。我们分别在有上下文和无上下文的情况下进行了实验。通过人工审查来评估错误。代码和实现说明可在GitHub上找到。

结果

标签的出现频率范围从0.05(结肠造口术、回肠造口术、指定血管活动性出血)到0.50(连续筋膜缝合)。Llama 3.3是整体表现最佳的模型(微观F1分数为0.88,5折范围:0.88 - 0.89;HL为0.11,5折范围:0.11 - 0.12)。BoW模型(微观F1分数为0.68,5折范围:0.64 - 0.71;HL为0.14,5折范围:0.13 - 0.16)和Clinical-Longformer(微观F1分数为

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe1c/12266303/d5695d98993b/medinform-v13-e71176-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验