Tsai Richard Tzong-Han, Sung Cheng-Lung, Dai Hong-Jie, Hung Hsieh-Chuan, Sung Ting-Yi, Hsu Wen-Lian
Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, Republic of China.
BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.
Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing.
To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features.
We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.
生物医学命名实体识别(Bio-NER)是一个具有挑战性的问题,因为一般来说,同一类别的生物医学命名实体(例如蛋白质和基因)并不遵循单一的标准命名法。它们有许多不规则之处,并且有时出现在模糊的语境中。近年来,机器学习(ML)方法变得越来越普遍,现在代表了Bio-NER技术的前沿。本文解决了基于ML的Bio-NER系统面临的三个问题。首先,大多数ML方法通常采用单例特征,这些特征包含一个语言属性(例如当前单词大写)和至少一个类别标签(例如B-蛋白质,蛋白质名称的开头)。然而,在必须考虑多个属性的情况下,这样的特征可能是不够的。添加包含多个属性的连词特征可能是有益的,但由于内存资源有限且一些特征无效,在NER模型中包含所有连词特征是不可行的。为了解决这个问题,我们使用顺序前向搜索算法来选择一组有效的特征。其次,生物医学术语数字部分的变化(例如生物医学术语IL2中的“2”)会导致数据稀疏并产生许多冗余特征。在这种情况下,我们应用数字归一化,通过用一个代表性数字替换术语中的所有数字来解决这个问题,以帮助对命名实体进行分类。第三,命名实体标签的分配不仅取决于目标单词最近的邻居,还可能取决于上下文窗口之外的单词(例如,由五个单词组成的上下文窗口包括当前单词加上前面两个单词和后面两个单词)。我们使用Smith-Waterman局部比对算法生成的全局模式来识别此类结构,并修改基于ML的标记器的结果。这称为基于模式的后处理。
为了开发我们基于ML的Bio-NER系统,我们采用条件随机场,它在几个著名任务中都表现有效,作为我们的基础ML模型。添加选定的连词特征、应用数字归一化和采用基于模式的后处理分别将F分数提高了1.67%、1.04%和0.57%。综合提高3.28%后,总得分达到72.98%,优于仅使用单例特征的基线系统。
我们证明了使用顺序前向搜索算法选择有效连词特征组的好处。此外,我们表明数字归一化可以有效地减少冗余和未见特征的数量。此外,Smith-Waterman局部比对算法可以帮助基于ML的Bio-NER处理需要更长上下文窗口的困难情况。