Medical Scientist Training Program, Albert Einstein College of Medicine, Bronx, NY, USA.
Penn Medicine Predictive Healthcare, University of Pennsylvania Health System, Philadelphia, PA, USA; Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.
Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings.
We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively.
Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS).
Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use.
Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.
量化在开发与临床相关的词嵌入方面,几种策略在性能、可重复性和资源需求方面的权衡。
我们在 Pubmed Central (PMC) 开放获取子集的所有全文手稿、其中的病例报告、英文维基百科语料库、医疗信息监护 (MIMIC) III 数据集以及宾夕法尼亚大学健康系统 (UPHS) 电子健康记录中的所有笔记上分别训练了单独的嵌入。我们在六个临床相关任务中测试了嵌入,包括死亡率预测和去识别,并分别使用标准化 Brier 得分 (SBS) 和成功去识别的笔记比例来评估性能。
来自 UPHS 笔记的嵌入在死亡率预测方面表现最佳(SBS 0.30,95%CI 0.15 至 0.45),而维基百科嵌入表现最差(SBS 0.12,95%CI -0.05 至 0.28)。维基百科嵌入最一致(78%的笔记),而完整的 PMC 语料库嵌入最不一致(48%)。在所有六个任务中,完整的 PMC 语料库表现出最一致的性能,而维基百科语料库表现最差。语料库规模从 4900 万令牌(PMC 病例报告)到 100 亿令牌(UPHS)不等。
在大多数任务中,在已发表的病例报告上训练的嵌入与在其他语料库上训练的嵌入表现相当,临床语料库始终优于非临床语料库。没有一个语料库在所有任务中产生严格占主导地位的嵌入集,因此最佳训练语料库取决于预期用途。
在大多数临床任务中,在已发表的病例报告上训练的嵌入与在更大的语料库上训练的嵌入表现相当。开放获取语料库允许训练与临床相关、有效且可重复的嵌入。