College of Computer Science and Technology, Huaqiao University, Xiamen, 361021, China.
Research Department, Zhiye software, Xiamen, 361021, China.
BMC Bioinformatics. 2019 Feb 1;20(1):62. doi: 10.1186/s12859-019-2617-8.
Benefiting from big data, powerful computation and new algorithmic techniques, we have been witnessing the renaissance of deep learning, particularly the combination of natural language processing (NLP) and deep neural networks. The advent of electronic medical records (EMRs) has not only changed the format of medical records but also helped users to obtain information faster. However, there are many challenges regarding researching directly using Chinese EMRs, such as low quality, huge quantity, imbalance, semi-structure and non-structure, particularly the high density of the Chinese language compared with English. Therefore, effective word segmentation, word representation and model architecture are the core technologies in the literature on Chinese EMRs.
In this paper, we propose a deep learning framework to study intelligent diagnosis using Chinese EMR data, which incorporates a convolutional neural network (CNN) into an EMR classification application. The novelty of this paper is reflected in the following: (1) We construct a pediatric medical dictionary based on Chinese EMRs. (2) Word2vec adopted in word embedding is used to achieve the semantic description of the content of Chinese EMRs. (3) A fine-tuning CNN model is constructed to feed the pediatric diagnosis with Chinese EMR data. Our results on real-world pediatric Chinese EMRs demonstrate that the average accuracy and F1-score of the CNN models are up to 81%, which indicates the effectiveness of the CNN model for the classification of EMRs. Particularly, a fine-tuning one-layer CNN performs best among all CNNs, recurrent neural network (RNN) (long short-term memory, gated recurrent unit) and CNN-RNN models, and the average accuracy and F1-score are both up to 83%.
The CNN framework that includes word segmentation, word embedding and model training can serve as an intelligent auxiliary diagnosis tool for pediatricians. Particularly, a fine-tuning one-layer CNN performs well, which indicates that word order does not appear to have a useful effect on our Chinese EMRs.
受益于大数据、强大的计算能力和新的算法技术,我们见证了深度学习的复兴,特别是自然语言处理(NLP)和深度神经网络的结合。电子病历(EMR)的出现不仅改变了病历的格式,还帮助用户更快地获取信息。然而,直接使用中文 EMR 进行研究存在许多挑战,例如质量低、数量大、不平衡、半结构化和非结构化,尤其是与英语相比,中文的密度更高。因此,有效的分词、词表示和模型架构是中文 EMR 文献研究的核心技术。
本文提出了一种基于深度学习的框架,利用中文 EMR 数据进行智能诊断,将卷积神经网络(CNN)应用于 EMR 分类应用中。本文的创新之处在于:(1)我们基于中文 EMR 构建了儿科医学词典。(2)采用词向量进行词嵌入,实现中文 EMR 内容的语义描述。(3)构建一个微调 CNN 模型,为儿科诊断提供中文 EMR 数据。我们在真实的儿科中文 EMR 上的结果表明,CNN 模型的平均准确率和 F1 分数高达 81%,表明 CNN 模型在 EMR 分类中的有效性。特别是,在所有的 CNN、递归神经网络(RNN)(长短期记忆、门控循环单元)和 CNN-RNN 模型中,一层微调 CNN 的性能最好,平均准确率和 F1 分数均高达 83%。
包含分词、词嵌入和模型训练的 CNN 框架可以作为儿科医生的智能辅助诊断工具。特别是,一层微调 CNN 表现良好,这表明在我们的中文 EMR 中,词序似乎没有有用的效果。