NERBio：利用选定的词连接、术语规范化和全局模式来改进生物医学命名实体识别。

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

作者信息

Tsai Richard Tzong-Han, Sung Cheng-Lung, Dai Hong-Jie, Hung Hsieh-Chuan, Sung Ting-Yi, Hsu Wen-Lian

机构信息

Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, Republic of China.

出版信息

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.

DOI:10.1186/1471-2105-7-S5-S11

PMID:17254295

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1764467/

Abstract

BACKGROUND

Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing.

RESULTS

To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features.

CONCLUSION

We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.

摘要

背景

生物医学命名实体识别（Bio-NER）是一个具有挑战性的问题，因为一般来说，同一类别的生物医学命名实体（例如蛋白质和基因）并不遵循单一的标准命名法。它们有许多不规则之处，并且有时出现在模糊的语境中。近年来，机器学习（ML）方法变得越来越普遍，现在代表了Bio-NER技术的前沿。本文解决了基于ML的Bio-NER系统面临的三个问题。首先，大多数ML方法通常采用单例特征，这些特征包含一个语言属性（例如当前单词大写）和至少一个类别标签（例如B-蛋白质，蛋白质名称的开头）。然而，在必须考虑多个属性的情况下，这样的特征可能是不够的。添加包含多个属性的连词特征可能是有益的，但由于内存资源有限且一些特征无效，在NER模型中包含所有连词特征是不可行的。为了解决这个问题，我们使用顺序前向搜索算法来选择一组有效的特征。其次，生物医学术语数字部分的变化（例如生物医学术语IL2中的“2”）会导致数据稀疏并产生许多冗余特征。在这种情况下，我们应用数字归一化，通过用一个代表性数字替换术语中的所有数字来解决这个问题，以帮助对命名实体进行分类。第三，命名实体标签的分配不仅取决于目标单词最近的邻居，还可能取决于上下文窗口之外的单词（例如，由五个单词组成的上下文窗口包括当前单词加上前面两个单词和后面两个单词）。我们使用Smith-Waterman局部比对算法生成的全局模式来识别此类结构，并修改基于ML的标记器的结果。这称为基于模式的后处理。

结果

为了开发我们基于ML的Bio-NER系统，我们采用条件随机场，它在几个著名任务中都表现有效，作为我们的基础ML模型。添加选定的连词特征、应用数字归一化和采用基于模式的后处理分别将F分数提高了1.67%、1.04%和0.57%。综合提高3.28%后，总得分达到72.98%，优于仅使用单例特征的基线系统。

结论

我们证明了使用顺序前向搜索算法选择有效连词特征组的好处。此外，我们表明数字归一化可以有效地减少冗余和未见特征的数量。此外，Smith-Waterman局部比对算法可以帮助基于ML的Bio-NER处理需要更长上下文窗口的困难情况。

相似文献

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

POSBIOTM-NER: a trainable biomedical named-entity recognition system.

Bioinformatics. 2005 Jun 1;21(11):2794-6. doi: 10.1093/bioinformatics/bti414. Epub 2005 Apr 6.

Two-phase biomedical named entity recognition using CRFs.

Comput Biol Chem. 2009 Aug;33(4):334-8. doi: 10.1016/j.compbiolchem.2009.07.004. Epub 2009 Aug 4.

Rich features based Conditional Random Fields for biological named entities recognition.

Comput Biol Med. 2007 Sep;37(9):1327-33. doi: 10.1016/j.compbiomed.2006.12.002. Epub 2007 Jan 19.

Automated recognition of malignancy mentions in biomedical literature.

BMC Bioinformatics. 2006 Nov 7;7:492. doi: 10.1186/1471-2105-7-492.

Challenges in clinical natural language processing for automated disorder normalization.

J Biomed Inform. 2015 Oct;57:28-37. doi: 10.1016/j.jbi.2015.07.010. Epub 2015 Jul 14.

BANNER: an executable survey of advances in biomedical named entity recognition.

Pac Symp Biocomput. 2008:652-63.

Biomedical named entity recognition using two-phase model based on SVMs.

J Biomed Inform. 2004 Dec;37(6):436-47. doi: 10.1016/j.jbi.2004.08.012.

Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features.

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S1. doi: 10.1186/1472-6947-13-S1-S1. Epub 2013 Apr 5.

引用本文的文献

Advancing entity recognition in biomedicine via instruction tuning of large language models.

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae163.

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study.

J Med Internet Res. 2023 Dec 6;25:e48145. doi: 10.2196/48145.

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation.

JMIR Med Inform. 2023 May 10;11:e44597. doi: 10.2196/44597.

Surgical procedure long terms recognition from Chinese literature incorporating structural feature.

Heliyon. 2022 Oct 29;8(11):e11291. doi: 10.1016/j.heliyon.2022.e11291. eCollection 2022 Nov.

LPInsider: a webserver for lncRNA-protein interaction extraction from the literature.

BMC Bioinformatics. 2022 Apr 15;23(1):135. doi: 10.1186/s12859-022-04665-3.

Machine learning applications for therapeutic tasks with genomics data.

Patterns (N Y). 2021 Aug 9;2(10):100328. doi: 10.1016/j.patter.2021.100328. eCollection 2021 Oct 8.

ANDDigest: a new web-based module of ANDSystem for the search of knowledge in the scientific literature.

BMC Bioinformatics. 2020 Sep 14;21(Suppl 11):228. doi: 10.1186/s12859-020-03557-8.

Family member information extraction via neural sequence labeling models with different tag schemes.

BMC Med Inform Decis Mak. 2019 Dec 27;19(Suppl 10):257. doi: 10.1186/s12911-019-0996-4.

Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings.

J Am Med Inform Assoc. 2020 Jan 1;27(1):47-55. doi: 10.1093/jamia/ocz120.

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition.

BMC Bioinformatics. 2019 May 29;20(Suppl 10):249. doi: 10.1186/s12859-019-2813-6.

本文引用的文献

Various criteria in the evaluation of biomedical named entity recognition.

BMC Bioinformatics. 2006 Feb 24;7:92. doi: 10.1186/1471-2105-7-92.

Identifying gene and protein mentions in text using conditional random fields.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. Epub 2005 May 24.

iProLINK: an integrated protein resource for literature mining.

Comput Biol Chem. 2004 Dec;28(5-6):409-16. doi: 10.1016/j.compbiolchem.2004.09.010.

Mining the biomedical literature in the genomic era: an overview.

J Comput Biol. 2003;10(6):821-55. doi: 10.1089/106652703322756104.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Playing biology's name game: identifying protein names in scientific text.

Pac Symp Biocomput. 2003:403-14.

Toward information extraction: identifying protein names from biological papers.

Pac Symp Biocomput. 1998:707-18.

Identification of common molecular subsequences.

J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

NERBio：利用选定的词连接、术语规范化和全局模式来改进生物医学命名实体识别。

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

作者信息

Tsai Richard Tzong-Han, Sung Cheng-Lung, Dai Hong-Jie, Hung Hsieh-Chuan, Sung Ting-Yi, Hsu Wen-Lian

机构信息

Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, Republic of China.

出版信息

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.

DOI:10.1186/1471-2105-7-S5-S11

PMID:17254295

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1764467/

Abstract

BACKGROUND

RESULTS

CONCLUSION

摘要

NERBio：利用选定的词连接、术语规范化和全局模式来改进生物医学命名实体识别。

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

NERBio：利用选定的词连接、术语规范化和全局模式来改进生物医学命名实体识别。

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献