基于已发表病例报告训练的词嵌入模型轻巧、适用于临床任务且不包含受保护的健康信息。

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.

机构信息

Medical Scientist Training Program, Albert Einstein College of Medicine, Bronx, NY, USA.

Penn Medicine Predictive Healthcare, University of Pennsylvania Health System, Philadelphia, PA, USA; Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.

出版信息

J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.

DOI:10.1016/j.jbi.2021.103971

PMID:34920127

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8766939/

Abstract

OBJECTIVE

Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings.

MATERIALS AND METHODS

We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively.

RESULTS

Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS).

DISCUSSION

Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use.

CONCLUSION

Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.

摘要

目的

量化在开发与临床相关的词嵌入方面，几种策略在性能、可重复性和资源需求方面的权衡。

材料和方法

我们在 Pubmed Central (PMC) 开放获取子集的所有全文手稿、其中的病例报告、英文维基百科语料库、医疗信息监护 (MIMIC) III 数据集以及宾夕法尼亚大学健康系统 (UPHS) 电子健康记录中的所有笔记上分别训练了单独的嵌入。我们在六个临床相关任务中测试了嵌入，包括死亡率预测和去识别，并分别使用标准化 Brier 得分 (SBS) 和成功去识别的笔记比例来评估性能。

结果

来自 UPHS 笔记的嵌入在死亡率预测方面表现最佳（SBS 0.30，95%CI 0.15 至 0.45），而维基百科嵌入表现最差（SBS 0.12，95%CI -0.05 至 0.28）。维基百科嵌入最一致（78%的笔记），而完整的 PMC 语料库嵌入最不一致（48%）。在所有六个任务中，完整的 PMC 语料库表现出最一致的性能，而维基百科语料库表现最差。语料库规模从 4900 万令牌（PMC 病例报告）到 100 亿令牌（UPHS）不等。

讨论

在大多数任务中，在已发表的病例报告上训练的嵌入与在其他语料库上训练的嵌入表现相当，临床语料库始终优于非临床语料库。没有一个语料库在所有任务中产生严格占主导地位的嵌入集，因此最佳训练语料库取决于预期用途。

结论

在大多数临床任务中，在已发表的病例报告上训练的嵌入与在更大的语料库上训练的嵌入表现相当。开放获取语料库允许训练与临床相关、有效且可重复的嵌入。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b78/8766939/7fcb4cc9b31c/nihms-1765734-f0002.jpg

相似文献

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.基于已发表病例报告训练的词嵌入模型轻巧、适用于临床任务且不包含受保护的健康信息。

J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study.用于对ICD-10-CM编码进行分类的混合采样训练投影词嵌入模型：纵向观察研究

JMIR Med Inform. 2019 Jul 23;7(3):e14499. doi: 10.2196/14499.

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。

BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.

The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.专业语料库对自然语言理解中词嵌入的影响。

Stud Health Technol Inform. 2020 Jun 16;270:432-436. doi: 10.3233/SHTI200197.

Domain specific word embeddings for natural language processing in radiology.用于放射学自然语言处理的特定领域词嵌入

J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.

Do You Need Embeddings Trained on a Massive Specialized Corpus for Your Clinical Natural Language Processing Task?对于您的临床自然语言处理任务，您是否需要在大规模专业语料库上训练的嵌入？

Stud Health Technol Inform. 2019 Aug 21;264:1558-1559. doi: 10.3233/SHTI190533.

Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish.西班牙语临床领域轻量级词嵌入的训练与内在评估

Front Artif Intell. 2022 Sep 21;5:970517. doi: 10.3389/frai.2022.970517. eCollection 2022.

Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

引用本文的文献

Natural language processing for kidney ultrasound analysis: correlating imaging reports with chronic kidney disease diagnosis.用于肾脏超声分析的自然语言处理：将影像报告与慢性肾脏病诊断相关联

Ren Fail. 2025 Dec;47(1):2539938. doi: 10.1080/0886022X.2025.2539938. Epub 2025 Aug 4.

Deep learning-based prediction model of acute kidney injury following coronary artery bypass grafting in coronary heart disease patients: a multicenter clinical study from China.基于深度学习的冠心病患者冠状动脉搭桥术后急性肾损伤预测模型：一项来自中国的多中心临床研究

Front Cardiovasc Med. 2025 Jun 23;12:1600012. doi: 10.3389/fcvm.2025.1600012. eCollection 2025.

Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing.2021 年：COVID-19、医学自然语言处理中的信息抽取和 BERT 化成为热门话题。

Yearb Med Inform. 2022 Aug;31(1):254-260. doi: 10.1055/s-0042-1742547. Epub 2022 Dec 4.

Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms.基于医学术语的计算系统：针对词汇表外多词术语的轻量级后处理解决方案。

Front Mol Biosci. 2022 Aug 12;9:928530. doi: 10.3389/fmolb.2022.928530. eCollection 2022.

Development and validation of a prediction model for actionable aspects of frailty in the text of clinicians' encounter notes.开发和验证临床医生就诊记录文本中虚弱可操作性方面的预测模型。

J Am Med Inform Assoc. 2021 Dec 28;29(1):109-119. doi: 10.1093/jamia/ocab248.

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation.利用开源词嵌入源开发COVID-19相关概念的词汇表：内在和外在评估

JMIR Med Inform. 2021 Feb 22;9(2):e21679. doi: 10.2196/21679.

本文引用的文献

A survey of word embeddings for clinical text.临床文本词嵌入研究

J Biomed Inform. 2019;100S:100057. doi: 10.1016/j.yjbinx.2019.100057. Epub 2019 Oct 28.

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.医学BERT：基于大规模结构化电子健康记录进行疾病预测的预训练上下文嵌入模型

NPJ Digit Med. 2021 May 20;4(1):86. doi: 10.1038/s41746-021-00455-y.

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study.探索词嵌入的隐私保护特性：算法验证研究

J Med Internet Res. 2020 Jul 15;22(7):e18055. doi: 10.2196/18055.

Using word embeddings to improve the privacy of clinical notes.利用词嵌入技术提高临床笔记的隐私性。

J Am Med Inform Assoc. 2020 Jun 1;27(6):901-907. doi: 10.1093/jamia/ocaa038.

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data.从海量多模态医学数据中学习的临床概念嵌入。

Pac Symp Biocomput. 2020;25:295-306.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

BioWordVec, improving biomedical word embeddings with subword information and MeSH.BioWordVec，利用子词信息和 MeSH 改进生物医学词向量。

Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.

Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.通用和特定词嵌入在研究转化阶段分类中的效用

AMIA Annu Symp Proc. 2018 Dec 5;2018:1405-1414. eCollection 2018.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验