Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37203, USA.
J Biomed Inform. 2012 Dec;45(6):1075-83. doi: 10.1016/j.jbi.2012.06.003. Epub 2012 Jun 25.
Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus [1]. However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.
缩写在临床文档中被广泛使用,但它们通常具有多义性。为每个歧义缩写词构建可能的含义列表(也称为含义清单)是自动识别给定上下文中缩写词正确含义的第一步。基于聚类的方法已被用于从临床语料库中检测缩写词的含义[1]。然而,罕见的含义仍然具有挑战性,现有的算法还不够好,无法检测到它们。在这项研究中,我们开发了一种称为稀有含义紧密聚类(TCRS)的新两阶段聚类算法,并将其应用于临床文本中缩写词的含义生成。使用 13 个模糊临床缩写词的一组手动注释含义清单,我们评估并比较了 TCRS 与现有的期望最大化(EM)聚类算法在两种不同的注释成本(每个缩写词 10 个实例与 20 个实例)下的含义生成。我们的结果表明,基于 TCRS 的方法平均可以检测到 85%的含义;而基于 EM 的方法在使用类似的注释工作量(每个缩写词约 20 个实例)时仅发现了 75%的含义。进一步的分析表明,TCRS 方法的改进主要来自于额外检测到的罕见含义,这表明它对于构建更完整的临床缩写词含义清单很有用。