Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan.
Bioinformatics. 2010 May 1;26(9):1246-53. doi: 10.1093/bioinformatics/btq129. Epub 2010 Mar 25.
The ultimate goal of abbreviation management is to disambiguate every occurrence of an abbreviation into its expanded form (concept or sense). To collect expanded forms for abbreviations, previous studies have recognized abbreviations and their expanded forms in parenthetical expressions of bio-medical texts. However, expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. Consequently, a list of expanded forms should be structured into a sense inventory, which provides possible concepts or senses for abbreviation disambiguation.
A sense inventory is a key to robust management of abbreviations. Therefore, we present a supervised approach for clustering expanded forms. The experimental result reports 0.915 F1 score in clustering expanded forms. We then investigate the possibility of conflicts of protein and gene names with abbreviations. Finally, an experiment of abbreviation disambiguation on the sense inventory yielded 0.984 accuracy and 0.986 F1 score using the dataset obtained from MEDLINE abstracts.
The sense inventory and disambiguator of abbreviations are accessible at http://www.nactem.ac.uk/software/acromine/ and http://www.nactem.ac.uk/software/acromine_disambiguation/.
缩写管理的最终目标是将每个缩写的出现都解析为其扩展形式(概念或意义)。为了收集缩写的扩展形式,以前的研究已经在生物医学文本的括号表达式中识别了缩写及其扩展形式。然而,缩写识别提取的扩展形式是概念/意义及其术语变体的混合物。因此,扩展形式列表应构建为意义清单,为缩写解析提供可能的概念或意义。
意义清单是缩写管理的关键。因此,我们提出了一种用于扩展形式聚类的监督方法。实验结果报告在聚类扩展形式方面的 F1 得分为 0.915。然后,我们研究了蛋白质和基因名称与缩写之间冲突的可能性。最后,在使用从 MEDLINE 摘要中获得的数据集对意义清单进行缩写解析的实验中,准确性为 0.984,F1 得分为 0.986。
可在 http://www.nactem.ac.uk/software/acromine/ 和 http://www.nactem.ac.uk/software/acromine_disambiguation/ 访问意义清单和缩写解析器。