Suppr超能文献

LCD基准:用于语言模型死亡率预测的长临床文档基准。

LCD benchmark: long clinical document benchmark on mortality prediction for language models.

作者信息

Yoon WonJin, Chen Shan, Gao Yanjun, Zhao Zhanzhan, Dligach Dmitriy, Bitterman Danielle S, Afshar Majid, Miller Timothy

机构信息

Computational Health Informatics Program, Boston Children's Hospital, Boston, MA 02215, United States.

Department of Pediatrics, Harvard Medical School, Boston, MA 02115, United States.

出版信息

J Am Med Inform Assoc. 2025 Feb 1;32(2):285-295. doi: 10.1093/jamia/ocae287.

Abstract

OBJECTIVES

The application of natural language processing (NLP) in the clinical domain is important due to the rich unstructured information in clinical documents, which often remains inaccessible in structured data. When applying NLP methods to a certain domain, the role of benchmark datasets is crucial as benchmark datasets not only guide the selection of best-performing models but also enable the assessment of the reliability of the generated outputs. Despite the recent availability of language models capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent.

MATERIALS AND METHODS

To address this issue, we propose Long Clinical Document (LCD) benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of Medical Information Mart for Intensive Care IV and statewide death data. We evaluated this benchmark dataset using baseline models, from bag-of-words and convolutional neural network to instruction-tuned large language models. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations.

RESULTS

Baseline models showed 28.9% for best-performing supervised models and 32.2% for GPT-4 in F1 metrics. Notes in our dataset have a median word count of 1687.

DISCUSSION

Our analysis of the model outputs showed that our dataset is challenging for both models and human experts, but the models can find meaningful signals from the text.

CONCLUSION

We expect our LCD benchmark to be a resource for the development of advanced supervised models, or prompting methods, tailored for clinical text.

摘要

目标

自然语言处理(NLP)在临床领域的应用至关重要,因为临床文档中存在丰富的非结构化信息,而这些信息在结构化数据中往往难以获取。在将NLP方法应用于特定领域时,基准数据集的作用至关重要,因为基准数据集不仅能指导最佳性能模型的选择,还能评估生成输出的可靠性。尽管最近出现了能够处理更长上下文的语言模型,但针对长临床文档分类任务的基准数据集却不存在。

材料与方法

为了解决这个问题,我们提出了长临床文档(LCD)基准,这是一个使用重症监护医学信息集市IV的出院小结和全州死亡数据来预测30天院外死亡率任务的基准。我们使用从词袋模型和卷积神经网络到指令微调的大语言模型等基线模型对这个基准数据集进行了评估。此外,我们对模型输出进行了全面分析,包括人工审查和模型权重可视化,以深入了解它们的预测能力和局限性。

结果

在F1指标方面,最佳性能的监督模型的基线模型显示为28.9%,GPT-4为32.2%。我们数据集中的笔记中位数字数为1687。

讨论

我们对模型输出的分析表明,我们的数据集对模型和人类专家来说都具有挑战性,但模型可以从文本中找到有意义的信号。

结论

我们期望我们的LCD基准能成为开发针对临床文本的先进监督模型或提示方法的资源。

相似文献

本文引用的文献

3
A large language model for electronic health records.用于电子健康记录的大型语言模型。
NPJ Digit Med. 2022 Dec 26;5(1):194. doi: 10.1038/s41746-022-00742-2.
5
Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform.Codabench:灵活、易用且可重现的元基准测试平台。
Patterns (N Y). 2022 Jun 24;3(7):100543. doi: 10.1016/j.patter.2022.100543. eCollection 2022 Jul 8.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验