文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

作者信息

Tsai Richard Tzong-Han, Sung Cheng-Lung, Dai Hong-Jie, Hung Hsieh-Chuan, Sung Ting-Yi, Hsu Wen-Lian

机构信息

Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan, Republic of China.

出版信息

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.


DOI:10.1186/1471-2105-7-S5-S11
PMID:17254295
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1764467/
Abstract

BACKGROUND: Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. RESULTS: To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features. CONCLUSION: We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.

摘要

相似文献

[1]
NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.

BMC Bioinformatics. 2006-12-18

[2]
Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004-5-1

[3]
POSBIOTM-NER: a trainable biomedical named-entity recognition system.

Bioinformatics. 2005-6-1

[4]
Two-phase biomedical named entity recognition using CRFs.

Comput Biol Chem. 2009-8

[5]
Rich features based Conditional Random Fields for biological named entities recognition.

Comput Biol Med. 2007-9

[6]
Automated recognition of malignancy mentions in biomedical literature.

BMC Bioinformatics. 2006-11-7

[7]
Challenges in clinical natural language processing for automated disorder normalization.

J Biomed Inform. 2015-10

[8]
BANNER: an executable survey of advances in biomedical named entity recognition.

Pac Symp Biocomput. 2008

[9]
Biomedical named entity recognition using two-phase model based on SVMs.

J Biomed Inform. 2004-12

[10]
Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features.

BMC Med Inform Decis Mak. 2013-4-5

引用本文的文献

[1]
Advancing entity recognition in biomedicine via instruction tuning of large language models.

Bioinformatics. 2024-3-29

[2]
OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study.

J Med Internet Res. 2023-12-6

[3]
Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation.

JMIR Med Inform. 2023-5-10

[4]
Surgical procedure long terms recognition from Chinese literature incorporating structural feature.

Heliyon. 2022-10-29

[5]
LPInsider: a webserver for lncRNA-protein interaction extraction from the literature.

BMC Bioinformatics. 2022-4-15

[6]
Machine learning applications for therapeutic tasks with genomics data.

Patterns (N Y). 2021-8-9

[7]
ANDDigest: a new web-based module of ANDSystem for the search of knowledge in the scientific literature.

BMC Bioinformatics. 2020-9-14

[8]
Family member information extraction via neural sequence labeling models with different tag schemes.

BMC Med Inform Decis Mak. 2019-12-27

[9]
Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings.

J Am Med Inform Assoc. 2020-1-1

[10]
CollaboNet: collaboration of deep neural networks for biomedical named entity recognition.

BMC Bioinformatics. 2019-5-29

本文引用的文献

[1]
Various criteria in the evaluation of biomedical named entity recognition.

BMC Bioinformatics. 2006-2-24

[2]
Identifying gene and protein mentions in text using conditional random fields.

BMC Bioinformatics. 2005

[3]
iProLINK: an integrated protein resource for literature mining.

Comput Biol Chem. 2004-12

[4]
Mining the biomedical literature in the genomic era: an overview.

J Comput Biol. 2003

[5]
Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004-5-1

[6]
Playing biology's name game: identifying protein names in scientific text.

Pac Symp Biocomput. 2003

[7]
Toward information extraction: identifying protein names from biological papers.

Pac Symp Biocomput. 1998

[8]
Identification of common molecular subsequences.

J Mol Biol. 1981-3-25

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索