Suppr超能文献

基于已发表病例报告训练的词嵌入模型轻巧、适用于临床任务且不包含受保护的健康信息。

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.

机构信息

Medical Scientist Training Program, Albert Einstein College of Medicine, Bronx, NY, USA.

Penn Medicine Predictive Healthcare, University of Pennsylvania Health System, Philadelphia, PA, USA; Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.

出版信息

J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.

Abstract

OBJECTIVE

Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings.

MATERIALS AND METHODS

We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively.

RESULTS

Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS).

DISCUSSION

Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use.

CONCLUSION

Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.

摘要

目的

量化在开发与临床相关的词嵌入方面,几种策略在性能、可重复性和资源需求方面的权衡。

材料和方法

我们在 Pubmed Central (PMC) 开放获取子集的所有全文手稿、其中的病例报告、英文维基百科语料库、医疗信息监护 (MIMIC) III 数据集以及宾夕法尼亚大学健康系统 (UPHS) 电子健康记录中的所有笔记上分别训练了单独的嵌入。我们在六个临床相关任务中测试了嵌入,包括死亡率预测和去识别,并分别使用标准化 Brier 得分 (SBS) 和成功去识别的笔记比例来评估性能。

结果

来自 UPHS 笔记的嵌入在死亡率预测方面表现最佳(SBS 0.30,95%CI 0.15 至 0.45),而维基百科嵌入表现最差(SBS 0.12,95%CI -0.05 至 0.28)。维基百科嵌入最一致(78%的笔记),而完整的 PMC 语料库嵌入最不一致(48%)。在所有六个任务中,完整的 PMC 语料库表现出最一致的性能,而维基百科语料库表现最差。语料库规模从 4900 万令牌(PMC 病例报告)到 100 亿令牌(UPHS)不等。

讨论

在大多数任务中,在已发表的病例报告上训练的嵌入与在其他语料库上训练的嵌入表现相当,临床语料库始终优于非临床语料库。没有一个语料库在所有任务中产生严格占主导地位的嵌入集,因此最佳训练语料库取决于预期用途。

结论

在大多数临床任务中,在已发表的病例报告上训练的嵌入与在更大的语料库上训练的嵌入表现相当。开放获取语料库允许训练与临床相关、有效且可重复的嵌入。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b78/8766939/7fcb4cc9b31c/nihms-1765734-f0002.jpg

相似文献

10
Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

引用本文的文献

本文引用的文献

1
A survey of word embeddings for clinical text.临床文本词嵌入研究
J Biomed Inform. 2019;100S:100057. doi: 10.1016/j.yjbinx.2019.100057. Epub 2019 Oct 28.
7
Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验