Peng Mingkai, Quan Hude
Department of Community Health Sciences, University of Calgary, Calgary, Canada.
Stud Health Technol Inform. 2020 Jun 16;270:88-92. doi: 10.3233/SHTI200128.
The objective of this study is to develop a method for clinical abbreviation disambiguation using deep contextualized representation and cluster analysis. We employed the pre-trained BioELMo language model to generate the contextualized word vector for abbreviations within each instance. Then principal component analysis was conducted on word vectors to reduce the dimension. K-Means cluster analysis was conducted for each abbreviation and the sense for a cluster was assigned based on the majority vote of annotations. Our method achieved an average accuracy of around 95% in 74 abbreviations. Simulation showed that each cluster required the annotation of 5 samples to determine its sense.
本研究的目的是开发一种使用深度语境化表示和聚类分析进行临床缩写消歧的方法。我们采用预训练的BioELMo语言模型为每个实例中的缩写生成语境化词向量。然后对词向量进行主成分分析以降低维度。对每个缩写进行K-Means聚类分析,并根据注释的多数投票为一个聚类分配语义。我们的方法在74个缩写中平均准确率达到了约95%。模拟表明,每个聚类需要标注5个样本才能确定其语义。