Ahltorp Magnus, Skeppstedt Maria, Kitajima Shiho, Henriksson Aron, Rzepka Rafal, Araki Kenji
, Stockholm, Sweden.
Department of Computer Science, Linnaeus University/Gavagai, Växjö/Stockholm, Sweden.
J Biomed Semantics. 2016 Sep 26;7(1):58. doi: 10.1186/s13326-016-0093-x.
Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs.
Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies.
Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding.
Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.
由于大多数语言中大型生物医学语料库的可用性有限,从大型语料库中扩展医学词汇的研究主要是使用英文或类似语言编写的文本进行的。然而,医学词汇对于从非英语且属于各种医学体裁的语料库中进行文本挖掘也至关重要。因此,本研究的目的是使用一个在语法、正字法以及文本体裁方面与先前使用的语料库截然不同的语料库来评估医学词汇的扩展。这是通过将基于分布语义学的方法应用于从大量日本患者博客语料库中提取医学词汇术语的任务来实现的。
使用随机索引对术语的分布属性进行建模,然后对来自现有词汇表的3×100个种子术语进行凝聚层次聚类,这些种子术语属于三个语义类别:医学发现、药物和身体部位。通过自动提取靠近创建聚类中心的未知术语,提出了要纳入词汇表的新术语候选词。该方法针对其检索现有医学词汇表中其余n个术语的能力进行了评估。
对于医学发现和药物类别,去除格助词并使用1+1的上下文窗口大小是一种成功的策略,而对于身体部位类别,保留格助词并使用8+8的窗口大小效果更好。对于长度为10n的候选词列表,不同的聚类大小对药物类别有影响,而对其他两个类别影响较小。然而,对于身体部位的前n个候选词列表,大小最多为两个术语的聚类比更大的聚类略有用。对于药物类别,最佳设置对于前n个术语的候选词列表召回率为25%,对于前10n个召回率为68%。对于前10n个候选词列表,医学发现类别获得了第二好的结果:召回率为58%,而身体部位类别为46%。然而,仅考虑前n个候选词时,身体部位类别的召回率为23%,医学发现类别为16%。
语料库预处理、窗口大小和聚类大小的不同设置适用于不同的语义类别和不同长度的候选词列表,这表明不仅要根据所使用的语言和文本体裁,还要根据要扩展词汇表的语义类别来调整参数。然而,结果表明所研究的预处理和参数设置选择是成功的,并且一个在许多方面与先前研究中使用的语料库不同的日本博客语料库可以成为医学词汇扩展的有用资源。