University of Technology Sydney (UTS), Australia; Capital Markets Cooperative Research Centre (CMCRC), Australia.
Capital Markets Cooperative Research Centre (CMCRC), Australia.
J Biomed Inform. 2017 Dec;76:102-109. doi: 10.1016/j.jbi.2017.11.007. Epub 2017 Nov 13.
Previous state-of-the-art systems on Drug Name Recognition (DNR) and Clinical Concept Extraction (CCE) have focused on a combination of text "feature engineering" and conventional machine learning algorithms such as conditional random fields and support vector machines. However, developing good features is inherently heavily time-consuming. Conversely, more modern machine learning approaches such as recurrent neural networks (RNNs) have proved capable of automatically learning effective features from either random assignments or automated word "embeddings".
(i) To create a highly accurate DNR and CCE system that avoids conventional, time-consuming feature engineering. (ii) To create richer, more specialized word embeddings by using health domain datasets such as MIMIC-III. (iii) To evaluate our systems over three contemporary datasets.
Two deep learning methods, namely the Bidirectional LSTM and the Bidirectional LSTM-CRF, are evaluated. A CRF model is set as the baseline to compare the deep learning systems to a traditional machine learning approach. The same features are used for all the models.
We have obtained the best results with the Bidirectional LSTM-CRF model, which has outperformed all previously proposed systems. The specialized embeddings have helped to cover unusual words in DrugBank and MedLine, but not in the i2b2/VA dataset.
We present a state-of-the-art system for DNR and CCE. Automated word embeddings has allowed us to avoid costly feature engineering and achieve higher accuracy. Nevertheless, the embeddings need to be retrained over datasets that are adequate for the domain, in order to adequately cover the domain-specific vocabulary.
以前的药物名称识别(DNR)和临床概念提取(CCE)的最新系统都集中在文本“特征工程”和传统机器学习算法(如条件随机场和支持向量机)的结合上。然而,开发良好的特征本质上是非常耗时的。相反,更现代的机器学习方法,如递归神经网络(RNN),已经证明能够从随机分配或自动单词“嵌入”中自动学习有效的特征。
(i)创建一个高度准确的 DNR 和 CCE 系统,避免传统的、耗时的特征工程。(ii)通过使用 MIMIC-III 等健康领域数据集创建更丰富、更专业的词嵌入。(iii)在三个现代数据集上评估我们的系统。
评估了两种深度学习方法,即双向 LSTM 和双向 LSTM-CRF。设置一个 CRF 模型作为基线,将深度学习系统与传统的机器学习方法进行比较。所有模型都使用相同的特征。
我们使用双向 LSTM-CRF 模型获得了最佳结果,该模型的性能优于所有以前提出的系统。专业的嵌入帮助涵盖了 DrugBank 和 MedLine 中的不常见单词,但在 i2b2/VA 数据集上则没有。
我们提出了一种药物名称识别和临床概念提取的最新系统。自动单词嵌入使我们能够避免昂贵的特征工程,并实现更高的准确性。然而,为了充分涵盖特定于领域的词汇,嵌入需要在适合该领域的数据集上进行重新训练。