Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.
Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada.
J Am Med Inform Assoc. 2019 May 1;26(5):438-446. doi: 10.1093/jamia/ocy189.
In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable.
Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner.
We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy.
Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.
在生物医学领域,研究文章和临床报告等非结构化叙述中隐藏着大量信息。为了正确利用这些数据,词位消歧(WSD)算法可防止自然语言处理应用程序管道下游出现困难。监督 WSD 算法在很大程度上优于非监督和半监督以及基于知识的方法;但是,它们为每个歧义术语训练 1 个单独的分类器,这需要大量专家标记的训练数据,这在医学信息学中是无法实现的目标。为了减轻这种需求,希望有一种单一的模型,该模型可以在所有实例中共享统计强度,并且可以很好地扩展词汇量。
基于深度学习的最新进展,我们的 deepBioWSD 模型利用单个双向长短期记忆网络为任何歧义术语进行语义预测。在该模型中,首先,将使用其文本定义计算统一医学语言系统的语义嵌入;然后,在使用这些嵌入初始化网络之后,将通过集体使用所有(可用)训练数据对其进行训练。该方法还考虑了一种从 PubMed 自动收集训练数据的新技术,以便以无监督的方式对网络进行预训练。
我们使用 MSH WSD 数据集来比较 WSD 算法,采用宏和微精度作为评估指标。deepBioWSD 在生物医学文本 WSD 中的表现优于现有模型,宏观精度达到 96.82%的最新性能。
除了改进消歧和无监督训练外,deepBioWSD 还依靠相对较少数量的专家标记数据,因为它可以共同学习目标和上下文术语。这些优点使 deepBioWSD 可以方便地部署在实时生物医学应用中。