一种用于识别蛋白质名称及其名称边界的概率模型。

A probabilistic model for identifying protein names and their name boundaries.

作者信息

Seki Kazuhiro, Mostafa Javed

机构信息

Laboratory of Applied Informatics Research, Indiana University, Bloomington, 47405-3907, USA.

出版信息

Proc IEEE Comput Soc Bioinform Conf. 2003;2:251-8.

PMID:16452800

Abstract

This paper proposes a method for identifying protein names in biomedical texts with an emphasis on detecting protein name boundaries. We use a probabilistic model which exploits several surface clues characterizing protein names and incorporates word classes for generalization. In contrast to previously proposed methods, our approach does not rely on natural language processing tools such as part-of-speech taggers and syntactic parsers, so as to reduce processing overhead and the potential number of probabilistic parameters to be estimated. A notion of certainty is also proposed to improve precision for identification. We implemented a protein name identification system based on our proposed method, and evaluated the system on real-world biomedical texts in conjunction with the previous work. The results showed that overall our system performs comparably to the state-of-the-art protein name identification system and that higher performance is achieved for compound names. In addition, it is demonstrated that our system can further improve precision by restricting the system output to those names with high certainties.

摘要

本文提出了一种在生物医学文本中识别蛋白质名称的方法，重点在于检测蛋白质名称的边界。我们使用一种概率模型，该模型利用了表征蛋白质名称的几个表面线索，并纳入词类进行泛化。与先前提出的方法不同，我们的方法不依赖诸如词性标注器和句法分析器等自然语言处理工具，以减少处理开销和待估计的概率参数数量。还提出了确定性的概念以提高识别的精度。我们基于所提出的方法实现了一个蛋白质名称识别系统，并结合先前的工作在真实世界的生物医学文本上对该系统进行了评估。结果表明，总体而言我们的系统与最先进的蛋白质名称识别系统性能相当，并且对于复合名称有更高的性能表现。此外，还证明了我们的系统可以通过将系统输出限制为具有高确定性的名称来进一步提高精度。

相似文献

A probabilistic model for identifying protein names and their name boundaries.

Proc IEEE Comput Soc Bioinform Conf. 2003;2:251-8.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Two learning approaches for protein name extraction.

J Biomed Inform. 2009 Dec;42(6):1046-55. doi: 10.1016/j.jbi.2009.05.004. Epub 2009 May 13.

Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid.

Int J Med Inform. 2006 Jun;75(6):456-67. doi: 10.1016/j.ijmedinf.2005.06.012. Epub 2005 Aug 19.

Use of morphological analysis in protein name recognition.

J Biomed Inform. 2004 Dec;37(6):471-82. doi: 10.1016/j.jbi.2004.08.001.

Gene name ambiguity of eukaryotic nomenclatures.

Bioinformatics. 2005 Jan 15;21(2):248-56. doi: 10.1093/bioinformatics/bth496. Epub 2004 Aug 27.

Probabilistic finite-state machines--part II.

IEEE Trans Pattern Anal Mach Intell. 2005 Jul;27(7):1026-39. doi: 10.1109/TPAMI.2005.148.

GAPSCORE: finding gene and protein names one word at a time.

Bioinformatics. 2004 Jan 22;20(2):216-25. doi: 10.1093/bioinformatics/btg393.

Probabilistic finite-state machines--part I.

IEEE Trans Pattern Anal Mach Intell. 2005 Jul;27(7):1013-25. doi: 10.1109/TPAMI.2005.147.

Discovering patterns to extract protein-protein interactions from full texts.

Bioinformatics. 2004 Dec 12;20(18):3604-12. doi: 10.1093/bioinformatics/bth451. Epub 2004 Jul 29.

引用本文的文献

Semi-supervised learning from small annotated data and large unlabeled data for fine-grained Participants, Intervention, Comparison, and Outcomes entity recognition.

J Am Med Inform Assoc. 2025 Mar 1;32(3):555-565. doi: 10.1093/jamia/ocae326.

Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model.

Proc IEEE Int Conf Big Data. 2023 Dec;2023:1476-1480. doi: 10.1109/BigData59044.2023.10386811.

A Deep Language Model for Symptom Extraction From Clinical Text and its Application to Extract COVID-19 Symptoms From Social Media.

IEEE J Biomed Health Inform. 2022 Apr;26(4):1737-1748. doi: 10.1109/JBHI.2021.3123192. Epub 2022 Apr 14.

Retrieval with gene queries.

BMC Bioinformatics. 2006 Apr 21;7:220. doi: 10.1186/1471-2105-7-220.

Various criteria in the evaluation of biomedical named entity recognition.

BMC Bioinformatics. 2006 Feb 24;7:92. doi: 10.1186/1471-2105-7-92.

Systematic feature evaluation for gene name recognition.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S9. doi: 10.1186/1471-2105-6-S1-S9. Epub 2005 May 24.

A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

J Am Med Inform Assoc. 2004 May-Jun;11(3):174-8. doi: 10.1197/jamia.M1453. Epub 2004 Feb 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于识别蛋白质名称及其名称边界的概率模型。

A probabilistic model for identifying protein names and their name boundaries.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献