LCD基准：用于语言模型死亡率预测的长临床文档基准。

LCD benchmark: long clinical document benchmark on mortality prediction for language models.

作者信息

Yoon WonJin, Chen Shan, Gao Yanjun, Zhao Zhanzhan, Dligach Dmitriy, Bitterman Danielle S, Afshar Majid, Miller Timothy

机构信息

Computational Health Informatics Program, Boston Children's Hospital, Boston, MA 02215, United States.

Department of Pediatrics, Harvard Medical School, Boston, MA 02115, United States.

出版信息

J Am Med Inform Assoc. 2025 Feb 1;32(2):285-295. doi: 10.1093/jamia/ocae287.

DOI:10.1093/jamia/ocae287

PMID:39602813

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11756648/

Abstract

OBJECTIVES

The application of natural language processing (NLP) in the clinical domain is important due to the rich unstructured information in clinical documents, which often remains inaccessible in structured data. When applying NLP methods to a certain domain, the role of benchmark datasets is crucial as benchmark datasets not only guide the selection of best-performing models but also enable the assessment of the reliability of the generated outputs. Despite the recent availability of language models capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent.

MATERIALS AND METHODS

To address this issue, we propose Long Clinical Document (LCD) benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of Medical Information Mart for Intensive Care IV and statewide death data. We evaluated this benchmark dataset using baseline models, from bag-of-words and convolutional neural network to instruction-tuned large language models. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations.

RESULTS

Baseline models showed 28.9% for best-performing supervised models and 32.2% for GPT-4 in F1 metrics. Notes in our dataset have a median word count of 1687.

DISCUSSION

Our analysis of the model outputs showed that our dataset is challenging for both models and human experts, but the models can find meaningful signals from the text.

CONCLUSION

We expect our LCD benchmark to be a resource for the development of advanced supervised models, or prompting methods, tailored for clinical text.

摘要

目标

自然语言处理（NLP）在临床领域的应用至关重要，因为临床文档中存在丰富的非结构化信息，而这些信息在结构化数据中往往难以获取。在将NLP方法应用于特定领域时，基准数据集的作用至关重要，因为基准数据集不仅能指导最佳性能模型的选择，还能评估生成输出的可靠性。尽管最近出现了能够处理更长上下文的语言模型，但针对长临床文档分类任务的基准数据集却不存在。

材料与方法

为了解决这个问题，我们提出了长临床文档（LCD）基准，这是一个使用重症监护医学信息集市IV的出院小结和全州死亡数据来预测30天院外死亡率任务的基准。我们使用从词袋模型和卷积神经网络到指令微调的大语言模型等基线模型对这个基准数据集进行了评估。此外，我们对模型输出进行了全面分析，包括人工审查和模型权重可视化，以深入了解它们的预测能力和局限性。

结果

在F1指标方面，最佳性能的监督模型的基线模型显示为28.9%，GPT-4为32.2%。我们数据集中的笔记中位数字数为1687。

讨论

我们对模型输出的分析表明，我们的数据集对模型和人类专家来说都具有挑战性，但模型可以从文本中找到有意义的信号。

结论

我们期望我们的LCD基准能成为开发针对临床文本的先进监督模型或提示方法的资源。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

LCD基准：用于语言模型死亡率预测的长临床文档基准。

LCD benchmark: long clinical document benchmark on mortality prediction for language models.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目标

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

LCD基准：用于语言模型死亡率预测的长临床文档基准。

LCD benchmark: long clinical document benchmark on mortality prediction for language models.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目标

材料与方法

结果

讨论

结论