Ren Libo, Belkadi Samuel, Han Lifeng, Del-Pinto Warren, Nenadic Goran
Department of Computer Science, University of Manchester, Greater Manchester, Manchester, United Kingdom.
Department of Engineering, University of Cambridge, Cambridge, United Kingdom.
Front Digit Health. 2025 May 30;7:1497130. doi: 10.3389/fdgth.2025.1497130. eCollection 2025.
Clinical letters contain sensitive information, limiting their use in model training, medical research, and education. This study aims to generate reliable, diverse, and de-identified synthetic clinical letters to support these tasks. We investigated multiple pre-trained language models for text masking and generation, focusing on Bio_ClinicalBERT, and applied different masking strategies. Evaluation included qualitative and quantitative assessments, downstream named entity recognition (NER) tasks, and clinically focused evaluations using BioGPT and GPT-3.5-turbo. The experiments show: (1) encoder-only models perform better than encoder-decoder models; (2) models trained on general corpora perform comparably to clinical-domain models if clinical entities are preserved; (3) preserving clinical entities and document structure aligns with the task objectives; (4) Masking strategies have a noticeable impact on the quality of synthetic clinical letters: masking stopwords has a positive impact, while masking nouns or verbs has a negative effect; (5) The BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references; (6) Contextual information has only a limited effect on the models' understanding, suggesting that synthetic letters can effectively substitute real ones in downstream NER tasks; (7) Although the model occasionally generates hallucinated content, it appears to have little effect on overall clinical performance. Unlike previous research, which primarily focuses on reconstructing original letters by training language models, this paper provides a foundational framework for generating diverse, de-identified clinical letters. It offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain. Our codes and trained models are available at https://github.com/HECTA-UoM/Synthetic4Health.
临床信件包含敏感信息,这限制了它们在模型训练、医学研究和教育中的应用。本研究旨在生成可靠、多样且经过去标识化处理的合成临床信件,以支持这些任务。我们研究了多个用于文本掩码和生成的预训练语言模型,重点关注Bio_ClinicalBERT,并应用了不同的掩码策略。评估包括定性和定量评估、下游命名实体识别(NER)任务,以及使用BioGPT和GPT-3.5-turbo进行的以临床为重点的评估。实验表明:(1)仅编码器模型的表现优于编码器-解码器模型;(2)如果保留临床实体,在通用语料库上训练的模型与临床领域模型的表现相当;(3)保留临床实体和文档结构与任务目标一致;(4)掩码策略对合成临床信件的质量有显著影响:掩码停用词有积极影响,而掩码名词或动词有负面影响;(5)BERTScore应作为主要的定量评估指标,其他指标作为补充参考;(6)上下文信息对模型理解的影响有限,这表明合成信件可以在下游NER任务中有效替代真实信件;(7)尽管模型偶尔会生成幻觉内容,但这似乎对整体临床性能影响不大。与以往主要关注通过训练语言模型来重建原始信件的研究不同,本文提供了一个生成多样、去标识化临床信件的基础框架。它为利用模型处理现实世界临床信件提供了方向,从而有助于扩展临床领域的数据集。我们的代码和训练好的模型可在https://github.com/HECTA-UoM/Synthetic4Health获取。