合成数据提炼能够大规模提取临床信息。

Synthetic data distillation enables the extraction of clinical information at scale.

作者信息

Woo Elizabeth Geena, Burkhart Michael C, Alsentzer Emily, Beaulieu-Jones Brett K

机构信息

Department of Medicine, Biological Sciences Division, University of Chicago, Chicago, IL, USA.

Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA.

出版信息

NPJ Digit Med. 2025 May 10;8(1):267. doi: 10.1038/s41746-025-01681-4.

DOI:10.1038/s41746-025-01681-4

PMID:40348936

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12065832/

Abstract

Large-language models (LLMs) show promise for clinical note information extraction, but deployment challenges include high computational costs and privacy concerns. We used synthetic data distillation to fine-tune smaller, open-source LLMs to achieve performance comparable to larger models while enabling local hardware deployment or reduced cloud costs. Using Llama-3.1-70B-Instruct, we generated synthetic question-answer training pairs to fine-tune smaller Llama models. We evaluated performance across three tasks: synthetic clinical trial criteria, the i2b2 2018 Clinical Trial Eligibility Challenge, and apixaban trial criteria questions. The 8B-parameter model achieved high accuracy across all tasks and sometimes outperformed the 70B-Instruct teacher model. Fine-tuning with only the most challenging questions still improved performance, demonstrating the value of targeted training. Results from 3B- and 1B-parameter models showed a clear size-performance tradeoff. This work demonstrates synthetic data distillation's potential for enabling scalable clinical information extraction.

摘要

大语言模型（LLMs）在临床记录信息提取方面展现出了潜力，但部署挑战包括高计算成本和隐私问题。我们使用合成数据蒸馏来微调较小的开源大语言模型，以实现与较大模型相当的性能，同时实现本地硬件部署或降低云成本。我们使用Llama-3.1-70B-Instruct生成合成问答训练对，以微调较小的Llama模型。我们评估了三个任务的性能：合成临床试验标准、i2b2 2018临床试验资格挑战赛以及阿哌沙班试验标准问题。具有80亿参数的模型在所有任务中都取得了高精度，有时甚至超过了具有700亿参数的Instruct教师模型。仅使用最具挑战性的问题进行微调仍然提高了性能，证明了有针对性训练的价值。具有30亿和10亿参数的模型的结果显示出明显的规模-性能权衡。这项工作证明了合成数据蒸馏在实现可扩展临床信息提取方面的潜力。