Henriksson Aron, Moen Hans, Skeppstedt Maria, Daudaravičius Vidas, Duneld Martin
Department of Computer and Systems Sciences (DSV), Stockholm University, Forum 100, SE-164 40 Kista, Sweden.
J Biomed Semantics. 2014 Feb 5;5(1):6. doi: 10.1186/2041-1480-5-6.
Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs.
A combination of two distributional models - Random Indexing and Random Permutation - employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora - a corpus of clinical text and a corpus of medical journal articles - further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms.
This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models - with different model parameters - and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.
通过将同义词和缩写与其相应概念相联系来解释语言使用变化的术语,是从医学文本中高质量提取信息的重要推动因素。由于医学领域使用专门的子语言,手动构建准确反映语言使用的语义资源既昂贵又具有挑战性,往往导致覆盖率低。尽管应用于大型语料库的分布语义模型提供了一种支持此类资源开发的潜在方法,但其从其他语义关系中分离同义关系的能力有限。它们在临床领域的应用也只是最近才开始被探索。结合分布模型并将其应用于不同类型的语料库可能会提高自动提取同义词和缩写扩展对任务的性能。
将两种分布模型——随机索引和随机排列——与单个语料库结合使用,比单独使用任何一种模型的性能都要好。此外,结合从不同类型语料库(临床文本语料库和医学期刊文章语料库)中诱导出的语义空间,能进一步改善结果,优于从单一来源诱导出的语义空间组合,以及从联合语料库中诱导出的单个语义空间。在探索的策略中,一种简单地将候选术语的余弦相似度分数相加的组合策略通常是最有效的。最后,应用简单的后处理过滤规则在提取缩写扩展对任务上能带来显著的性能提升,但对提取同义词任务则不然。以十个候选术语列表中的召回率衡量,这三个任务的最佳结果分别是:缩写转全称0.39,全称转缩写0.33,同义词0.47。
本研究表明,语义空间的集成在自动提取同义词和缩写扩展对任务上能产生更好的性能。这一值得进一步探索的概念允许将具有不同模型参数的不同分布模型和不同类型的语料库进行组合,有可能在广泛的自然语言处理任务上获得更好的性能。