基于最大双向挤压的联合 SVM-CRFs 生物命名实体识别。

Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.

机构信息

Center for Systems Biology, Soochow University, Suzhou, Jiangsu, China.

出版信息

PLoS One. 2012;7(6):e39230. doi: 10.1371/journal.pone.0039230. Epub 2012 Jun 26.

DOI:10.1371/journal.pone.0039230

PMID:22745720

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3383748/

Abstract

Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F(1) of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.

摘要

生物命名实体识别，即文本中生物术语的识别，是生物医学信息提取的关键。基于机器学习的方法已广泛应用于该领域。然而，当前方法的识别性能仍有待提高。我们的新方法是结合支持向量机（SVM）和条件随机场（CRFs），它们可以相互补充和促进。在混合过程中，我们使用 SVM 将生物术语与非生物术语分开，然后使用 CRFs 确定生物术语的类型，这充分利用了 SVM 作为二分类器的功能和 CRFs 的数据标记能力。然后，我们合并 SVM 和 CRFs 的结果。为了消除合并可能导致的任何不一致，我们开发了一种有用的算法并应用了两条规则。为了确保识别出具有最大长度的生物术语，我们提出了一种最长双向挤压方法来找到最长的术语。我们还为稀有事件增加了正增益，以增强它们的概率并避免偏差。我们的方法还将逐步扩展上下文，以包含更多的上下文信息。我们使用 GENIA 语料库和 JNLPBA04 数据评估了四种方法的性能。SVM 和 CRFs 的组合提高了性能。SVM-CRFs 混合方法的宏精度、宏召回率和宏 F1 均优于传统的 SVM 和 CRFs。在应用新算法后，宏 F1 在 GENIA 语料库中达到 91.67%，在 JNLPBA04 数据中达到 84.04%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/208b/3383748/49ca0cb5a11c/pone.0039230.g001.jpg

相似文献

Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.基于最大双向挤压的联合 SVM-CRFs 生物命名实体识别。

PLoS One. 2012;7(6):e39230. doi: 10.1371/journal.pone.0039230. Epub 2012 Jun 26.

Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features.使用带有词表示特征的结构支持向量机识别医院出院小结中的临床实体。

BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S1. doi: 10.1186/1472-6947-13-S1-S1. Epub 2013 Apr 5.

Biomedical named entity recognition using two-phase model based on SVMs.基于支持向量机的两阶段模型的生物医学命名实体识别

J Biomed Inform. 2004 Dec;37(6):436-47. doi: 10.1016/j.jbi.2004.08.012.

Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称：一种机器学习方法。

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Scalable biomedical Named Entity Recognition: investigation of a database-supported SVM approach.可扩展的生物医学命名实体识别：数据库支持的支持向量机方法研究

Int J Bioinform Res Appl. 2010;6(2):191-208. doi: 10.1504/IJBRA.2010.032121.

Two-phase biomedical named entity recognition using CRFs.使用条件随机场的两阶段生物医学命名实体识别

Comput Biol Chem. 2009 Aug;33(4):334-8. doi: 10.1016/j.compbiolchem.2009.07.004. Epub 2009 Aug 4.

A hybrid named entity tagger for tagging human proteins/genes.一种用于标记人类蛋白质/基因的混合命名实体标记器。

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

Identifying non-elliptical entity mentions in a coordinated NP with ellipses.识别带省略的并列名词短语中的非椭圆实体提及。

J Biomed Inform. 2014 Feb;47:139-52. doi: 10.1016/j.jbi.2013.10.002. Epub 2013 Oct 20.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法：一种用于判别式多类别蛋白质折叠和超家族识别的工具。

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Towards reliable named entity recognition in the biomedical domain.迈向生物医学领域可靠的命名实体识别

Bioinformatics. 2020 Jan 1;36(1):280-286. doi: 10.1093/bioinformatics/btz504.

引用本文的文献

BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition.BCC-NER：用于基因/蛋白质提及识别的双向上下文线索命名实体标记器。

EURASIP J Bioinform Syst Biol. 2017 Dec;2017(1):7. doi: 10.1186/s13637-017-0060-6. Epub 2017 May 5.

Disorder recognition in clinical texts using multi-label structured SVM.使用多标签结构化支持向量机识别临床文本中的病症

BMC Bioinformatics. 2017 Jan 31;18(1):75. doi: 10.1186/s12859-017-1476-4.

Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer.基于文本挖掘和强化学习构建蛋白质相互作用网络及其在前列腺癌中的应用

IET Syst Biol. 2015 Aug;9(4):106-12. doi: 10.1049/iet-syb.2014.0050.

Unregistered biological words recognition by Q-learning with transfer learning.基于迁移学习的Q学习对未注册生物词汇的识别

ScientificWorldJournal. 2014 Feb 19;2014:173290. doi: 10.1155/2014/173290. eCollection 2014.

本文引用的文献

Epilepsy surgery in a pediatric population: a retrospective study of 129 children from a tertiary care hospital in a developing country along with assessment of quality of life.儿科人群的癫痫手术：对来自一个发展中国家三级护理医院的129名儿童的回顾性研究及生活质量评估。

Pediatr Neurosurg. 2011;47(3):186-93. doi: 10.1159/000334257. Epub 2011 Dec 29.

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus.针对首个和第二个CALBC银标准语料库对命名实体识别解决方案进行评估。

J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S11. doi: 10.1186/2041-1480-2-S5-S11.

Named entity recognition for bacterial Type IV secretion systems.细菌 IV 型分泌系统的命名实体识别。

PLoS One. 2011 Mar 29;6(3):e14780. doi: 10.1371/journal.pone.0014780.

Scalable biomedical Named Entity Recognition: investigation of a database-supported SVM approach.可扩展的生物医学命名实体识别：数据库支持的支持向量机方法研究

Int J Bioinform Res Appl. 2010;6(2):191-208. doi: 10.1504/IJBRA.2010.032121.

HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features.HypertensionGene：从生物医学文献中提取关键高血压基因，使用位置和自动生成的模板特征。

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S9. doi: 10.1186/1471-2105-10-S15-S9.

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature.BIOADI：一种用于识别生物文献中缩写词和定义的机器学习方法。

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S7. doi: 10.1186/1471-2105-10-S15-S7.

Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.应用基因标识符改进突变标记以用于膜蛋白稳定性预测。

BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-10-S8-S3.

Two-phase biomedical named entity recognition using CRFs.使用条件随机场的两阶段生物医学命名实体识别

Comput Biol Chem. 2009 Aug;33(4):334-8. doi: 10.1016/j.compbiolchem.2009.07.004. Epub 2009 Aug 4.

Incorporating rich background knowledge for gene named entity classification and recognition.整合丰富的背景知识用于基因命名实体分类与识别。

BMC Bioinformatics. 2009 Jul 17;10:223. doi: 10.1186/1471-2105-10-223.

Biological entity recognition with conditional random fields.基于条件随机场的生物实体识别。

AMIA Annu Symp Proc. 2008 Nov 6;2008:293-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于最大双向挤压的联合 SVM-CRFs 生物命名实体识别。

Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献