一种基于大语言模型的生成式自然语言处理框架，在临床笔记上进行微调后，能准确从电子健康记录中提取头痛频率。

A Large Language Model-Based Generative Natural Language Processing Framework Finetuned on Clinical Notes Accurately Extracts Headache Frequency from Electronic Health Records.

作者信息

Chiang Chia-Chun, Luo Man, Dumkrieger Gina, Trivedi Shubham, Chen Yi-Chieh, Chao Chieh-Ju, Schwedt Todd J, Sarker Abeed, Banerjee Imon

机构信息

Department of Neurology, Mayo Clinic, Rochester, MN.

Department of Radiology, Mayo Clinic, Phoenix, AZ.

出版信息

medRxiv. 2023 Oct 3:2023.10.02.23296403. doi: 10.1101/2023.10.02.23296403.

DOI:10.1101/2023.10.02.23296403

PMID:37873417

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10593021/

Abstract

BACKGROUND

Headache frequency, defined as the number of days with any headache in a month (or four weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional natural language processing (NLP) algorithms.

METHODS

This was a retrospective cross-sectional study with human subjects identified from three tertiary headache referral centers- Mayo Clinic Arizona, Florida, and Rochester. All neurology consultation notes written by more than 10 headache specialists between 2012 to 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) (2) Generative Pre-Trained Transformer-2 ( fine-tuned on Mayo Clinic notes; and fine-tuned on Mayo Clinic notes to generate the answer by considering the context of included text.

RESULTS

The GPT-2 generative model was the best-performing model with an accuracy of 0.92[0.91 - 0.93] and R score of 0.89[0.87, 0.9], and all GPT2-based models outperformed the ClinicalBERT model in terms of the exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy 0.27[0.26 - 0.28], it demonstrated a high R score 0.88[0.85, 0.89], suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model.

CONCLUSION

We developed a robust model based on a state-of-the-art large language model (LLM)- a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT2-based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub.

摘要

背景

头痛频率定义为一个月（或四周）内出现任何头痛症状的天数，仍然是评估偏头痛预防性药物治疗反应的关键参数。然而，由于临床医生记录的差异和不一致性，传统自然语言处理（NLP）算法从电子健康记录（EHR）中准确提取头痛频率存在重大挑战。

方法

这是一项回顾性横断面研究，研究对象来自三个三级头痛转诊中心——亚利桑那州、佛罗里达州和罗切斯特的梅奥诊所。提取了2012年至2022年期间10多位头痛专家撰写的所有神经科会诊记录，其中1915份记录用于模型微调（90%）和测试（10%）。我们采用了四种不同的NLP框架：（1）（2）生成式预训练变换器-2（在梅奥诊所记录上进行微调；并在梅奥诊所记录上进行微调，通过考虑包含文本的上下文来生成答案。

结果

GPT-2生成模型是表现最佳的模型，准确率为0.92[0.91 - 0.93]，R分数为0.89[0.87, 0.9]，并且所有基于GPT2的模型在精确匹配准确率方面均优于ClinicalBERT模型。尽管ClinicalBERT回归模型的准确率最低，为0.27[0.26 - 0.28]，但其R分数较高，为0.88[0.85, 0.89]，这表明ClinicalBERT模型可以在≤ ± 3天的范围内合理预测头痛频率，并且R分数高于GPT-2问答零样本模型或GPT-2问答模型少样本训练微调模型。

结论

我们基于一种先进的大语言模型（LLM）——GPT-2生成模型开发了一个强大的模型，该模型可以从EHR自由文本临床记录中高精度地提取头痛频率，并具有较高的R分数。它克服了与临床医生记录头痛频率的不同方式相关的几个挑战，而传统NLP模型不容易实现这些挑战。我们还表明，基于GPT2的框架在从临床记录中提取头痛频率的准确性方面优于ClinicalBERT。为了促进该领域的研究，我们在GitHub上以社区使用的开源许可发布了GPT-2生成模型和推理代码。