Pérez Alicia, Weegar Rebecka, Casillas Arantza, Gojenola Koldo, Oronoz Maite, Dalianis Hercules
IXA Group, University of the Basque Country (UPV-EHU), Spain(1).
Clinical Text Mining Group, Department of Computer and System Sciences (DSV), Stockholm University, Sweden.
J Biomed Inform. 2017 Jul;71:16-30. doi: 10.1016/j.jbi.2017.05.009. Epub 2017 May 16.
The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations.
The significance of this work stands on its experimental layout. The experiments were carried out under the same conditions for both languages. Several classification approaches were explored: maximum probability, CRF, Perceptron and SVM. The classifiers were enhanced by means of ensembles of semantic spaces and ensembles of Brown trees. In order to mitigate sparsity of data, without a significant increase in the dimension of the decision space, we propose the use of clustered approaches of the hierarchical Brown clustering represented by trees and vector quantization for each semantic space.
The results showed that the semi-supervised approaches significantly improved standard supervised techniques for both languages. Moreover, clustering the semantic spaces contributed to the quality of the entity recognition while keeping the dimension of the feature-space two orders of magnitude lower than when directly using the semantic spaces.
The contributions of this study are: (a) a set of thorough experiments that enable comparisons regarding the influence of different types of features on different classifiers, exploring two languages other than English; and (b) the use of ensembles of clusters of Brown trees and semantic spaces on EHRs to tackle the problem of scarcity of available annotated data.
本研究的目标是调查电子健康记录(EHR)中的实体识别,重点关注西班牙语和瑞典语。实体的强大表示尤为重要。在我们的案例中,我们使用无监督方法来生成此类表示。
这项工作的重要性在于其实验布局。两种语言的实验均在相同条件下进行。探索了几种分类方法:最大概率法、条件随机场(CRF)、感知机和支持向量机(SVM)。通过语义空间集合和布朗树集合增强分类器。为了减轻数据稀疏性,在不显著增加决策空间维度的情况下,我们建议对由树表示的分层布朗聚类和每个语义空间的向量量化使用聚类方法。
结果表明,半监督方法显著改进了两种语言的标准监督技术。此外,对语义空间进行聚类有助于实体识别的质量,同时使特征空间的维度比直接使用语义空间时低两个数量级。
本研究的贡献在于:(a)一组全面的实验,能够比较不同类型特征对不同分类器的影响,探索了英语以外的两种语言;(b)在电子健康记录上使用布朗树聚类和语义空间集合来解决可用注释数据稀缺的问题。