Tsuruoka Yoshimasa, McNaught John, Ananiadou Sophia
School of Computer Science, The University of Manchester, MIB, 131 Princess Street, Manchester, M1 7DN, UK.
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2105-9-S3-S2.
One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach.
We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS.
The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.
将生物医学命名实体(如基因、蛋白质、化学物质和疾病)映射到其概念标识符的困难之一源于术语的潜在变异性。软字符串匹配是解决该问题的一种可能方法,但其固有的高计算成本使其在词典较大或需要实时处理时不被采用。一种计算要求较低的方法是使用启发式规则对术语进行规范化,这使我们能够在固定时间内查找词典,而不管其大小如何。然而,制定良好的启发式规则需要对相关术语有广泛的了解,因此这是规范化方法的瓶颈。
我们提出了一个新颖的框架,用于以完全自动化的方式从词典中发现规范化规则列表。所发现的规则能够最大程度地减少词典中术语的歧义性和变异性。我们使用两个大型词典对我们的算法进行了评估:一个是基于生物词库构建的人类基因/蛋白质名称词典,另一个是基于统一医学语言系统构建的疾病名称词典。
实验结果表明,自动发现的规则在术语映射任务中能够与精心制定的启发式规则相媲美,并且规则应用的计算开销足够小,从而可以实现非常快速的实现。这项工作将有助于提高生物医学信息提取中术语-概念映射任务的性能,特别是在尚未完全了解目标术语的良好规范化启发式方法时。