Harel-Canada Fabrice, Salimian Anabel, Moghanian Brandon, Clingan Sarah, Nguyen Allan, Avra Tucker, Poimboeuf Michelle, Romero Ruby, Funnell Arthur, Petousis Panayiotis, Shin Michael, Peng Nanyun, Shover Chelsea L, Goodman-Meza David
Computer Science Department, University of California, Los Angeles, 404 Westwood Plaza Suite 277, Los Angeles, 90095, CA, USA.
Semel Institute for Neuroscience and Human Behavior at University of California, Los Angeles, 760 Westwood Plaza, Los Angeles, 90024, CA, USA.
Res Sq. 2025 May 15:rs.3.rs-6615981. doi: 10.21203/rs.3.rs-6615981/v1.
Identifying substance use behaviors in electronic health records (EHRs) is challenging because critical details are often buried in unstructured notes that use varied terminology and negation, requiring careful contextual interpretation to distinguish relevant use from historical mentions or denials. Using MIMIC-III/IV discharge summaries, we created a large, annotated drug detection dataset to tackle this problem and support future systemic substance use surveillance. We then investigated the performance of multiple large language models (LLMs) for detecting eight substance use categories within this data. Evaluating models in zero-shot, few-shot, and fine-tuning configurations, we found that a fine-tuned model, Llama-DrugDetector-70B, outperformed others. It achieved near-perfect F1-scores (≥ 0.95) for most individual substances and strong scores for more complex tasks like prescription opioid misuse (F1=0.815) and polysubstance use (F1=0.917). These findings demonstrate that LLMs significantly enhance detection, showing promise for clinical decision support and research, although further work on scalability is warranted.
在电子健康记录(EHRs)中识别物质使用行为具有挑战性,因为关键细节往往隐藏在非结构化笔记中,这些笔记使用了各种术语和否定词,需要仔细的上下文解释才能将相关使用与历史提及或否定区分开来。利用MIMIC-III/IV出院小结,我们创建了一个大型的、带注释的药物检测数据集来解决这个问题,并支持未来的系统性物质使用监测。然后,我们研究了多个大语言模型(LLMs)在该数据中检测八种物质使用类别的性能。在零样本、少样本和微调配置下评估模型,我们发现一个微调模型Llama-DrugDetector-70B的表现优于其他模型。对于大多数单一物质,它实现了近乎完美的F1分数(≥0.95),对于更复杂的任务,如处方阿片类药物滥用(F1 = 0.815)和多物质使用(F1 = 0.917)也取得了不错的分数。这些发现表明,大语言模型显著提高了检测能力,尽管在可扩展性方面还有进一步的工作需要开展,但它在临床决策支持和研究方面显示出了前景。