Xu Hua, Stetson Peter D, Friedman Carol
Department of Biomedical Informatics, Columbia University, New York, NY, USA.
J Am Med Inform Assoc. 2009 Jan-Feb;16(1):103-8. doi: 10.1197/jamia.M2927. Epub 2008 Oct 24.
To develop methods for building corpus-specific sense inventories of abbreviations occurring in clinical documents.
A corpus of internal medicine admission notes was collected and instances of each clinical abbreviation in the corpus were clustered to different sense clusters. One instance from each cluster was manually annotated to generate a final list of senses. Two clustering-based methods (Expectation Maximization--EM and Farthest First--FF) and one random sampling method for sense detection were evaluated using a set of 12 clinical abbreviations.
The clustering-based sense detection methods were evaluated using a set of clinical abbreviations that were manually sense annotated. "Sense Completeness" and "Annotation Cost" were used to measure the performance of different methods. Clustering error rates were also reported for different clustering algorithms.
A clustering-based semi-automated method was developed to build corpus-specific sense inventories for abbreviations in hospital admission notes. Evaluation demonstrated that this method could largely reduce manual annotation cost and increase the completeness of sense inventories when compared with a manual annotation method using random samples.
The authors developed an effective clustering-based method for building corpus-specific sense inventories for abbreviations in a clinical corpus. To the best of the authors knowledge, this is the first time clustering technologies have been used to help building sense inventories of abbreviations in clinical text. The results demonstrated that the clustering-based method performed better than the manual annotation method using random samples for the task of building sense inventories of clinical abbreviations.
开发用于构建临床文档中出现的缩写词的特定语料库词义清单的方法。
收集了一组内科入院记录语料库,并将语料库中每个临床缩写词的实例聚类到不同的词义簇中。从每个簇中手动标注一个实例,以生成最终的词义列表。使用一组12个临床缩写词对两种基于聚类的方法(期望最大化算法——EM和最远优先算法——FF)以及一种用于词义检测的随机抽样方法进行了评估。
使用一组经过人工词义标注的临床缩写词对基于聚类的词义检测方法进行评估。“词义完整性”和“标注成本”用于衡量不同方法的性能。还报告了不同聚类算法的聚类错误率。
开发了一种基于聚类的半自动方法,用于构建医院入院记录中缩写词的特定语料库词义清单。评估表明,与使用随机样本的人工标注方法相比,该方法可以大大降低人工标注成本,并提高词义清单的完整性。
作者开发了一种有效的基于聚类的方法,用于构建临床语料库中缩写词的特定语料库词义清单。据作者所知,这是首次使用聚类技术来帮助构建临床文本中缩写词的词义清单。结果表明,在构建临床缩写词词义清单的任务中,基于聚类的方法比使用随机样本的人工标注方法表现更好。