Khattak Faiza Khan, Jeblee Serena, Pou-Prom Chloé, Abdalla Mohamed, Meaney Christopher, Rudzicz Frank
Department of Computer Science, University of Toronto, Toronto, Ontario, Canada; Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, Ontario, Canada.
Department of Computer Science, University of Toronto, Toronto, Ontario, Canada; Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
J Biomed Inform. 2019;100S:100057. doi: 10.1016/j.yjbinx.2019.100057. Epub 2019 Oct 28.
Representing words as numerical vectors based on the contexts in which they appear has become the de facto method of analyzing text with machine learning. In this paper, we provide a guide for training these representations on clinical text data, using a survey of relevant research. Specifically, we discuss different types of word representations, clinical text corpora, available pre-trained clinical word vector embeddings, intrinsic and extrinsic evaluation, applications, and limitations of these approaches. This work can be used as a blueprint for clinicians and healthcare workers who may want to incorporate clinical text features in their own models and applications.
基于单词出现的上下文将其表示为数值向量已成为使用机器学习分析文本的实际方法。在本文中,我们通过对相关研究的综述,为在临床文本数据上训练这些表示提供了指南。具体而言,我们讨论了不同类型的单词表示、临床文本语料库、可用的预训练临床词向量嵌入、内在和外在评估、应用以及这些方法的局限性。这项工作可以作为临床医生和医护人员的蓝图,他们可能希望在自己的模型和应用中纳入临床文本特征。