Chair of Medical Informatics, University Erlangen-Nuremberg, Germany.
Stud Health Technol Inform. 2022 May 16;292:23-27. doi: 10.3233/SHTI220314.
Among medical applications of natural language processing (NLP), word sense disambiguation (WSD) estimates alternative meanings from text around homonyms. Recently developed NLP methods include word vectors that combine easy computability with nuanced semantic representations. Here we explore the utility of simple linear WSD classifiers based on aggregating word vectors from a modern biomedical NLP library in homonym contexts. We evaluated eight WSD tasks that consider literature abstracts as textual contexts. Discriminative performance was measured in held-out annotations as the median area under sensitivity-specificity curves (AUC) across tasks and 200 bootstrap repetitions. We find that classifiers trained on domain-specific vectors outperformed those from a general language model by 4.0 percentage points, and that a preprocessing step of filtering stopwords and punctuation marks enhanced discrimination by another 0.7 points. The best models achieved a median AUC of 0.992 (interquartile range 0.975 - 0.998). These improvements suggest that more advanced WSD methods might also benefit from leveraging domain-specific vectors derived from large biomedical corpora.
在自然语言处理(NLP)的医学应用中,词义消歧(WSD)从同形异义词的文本中估计替代含义。最近开发的 NLP 方法包括词向量,它将易于计算和细微的语义表示结合在一起。在这里,我们探索了基于现代生物医学 NLP 库中词向量聚合的简单线性 WSD 分类器在同形异义词上下文中的效用。我们评估了八个 WSD 任务,这些任务将文献摘要作为文本上下文。在保留的注释中,通过在 200 次 bootstrap 重复中测量跨任务的灵敏度-特异性曲线(AUC)的中位数来衡量判别性能。我们发现,基于特定于领域的向量训练的分类器比基于通用语言模型的分类器高出 4.0 个百分点,并且过滤停用词和标点符号的预处理步骤又提高了 0.7 个百分点。最佳模型的中位数 AUC 为 0.992(四分位距 0.975-0.998)。这些改进表明,更先进的 WSD 方法也可能受益于利用从大型生物医学语料库中得出的特定于领域的向量。