检测国际纯粹与应用化学联合会（IUPAC）及类IUPAC化学名称。

Detection of IUPAC and IUPAC-like chemical names.

作者信息

Klinger Roman, Kolárik Corinna, Fluck Juliane, Hofmann-Apitius Martin, Friedrich Christoph M

机构信息

Fraunhofer Institute Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53574 Sankt Augustin, Germany.

出版信息

Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.

DOI:10.1093/bioinformatics/btn181

PMID:18586724

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2718657/

Abstract

MOTIVATION

Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools.

RESULTS

We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run.

AVAILABILITY

We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.

摘要

动机

诸如小信号分子或其他生物活性化学物质之类的化合物是生命科学出版物和专利中的重要实体类别。存在多种化学物质的表示法和命名法，如SMILES、InChI、IUPAC或俗名。只有SMILES和InChI名称允许直接进行结构搜索，但在生物医学文本中，俗名和类似Iupac的名称使用得更为频繁。虽然可以通过基于字典的方法找到俗名，并以此方式将其映射到相应的结构，但不可能枚举所有IUPAC名称。在这项工作中，我们提出了一种基于条件随机场（CRF）的新机器学习方法，用于在科学文本中查找IUPAC和类似IUPAC的名称，以及对其进行评估和与可用的名称到结构工具的转化率。

结果

我们提出了一种IUPAC名称识别器，在MEDLINE语料库上的F(1)度量为85.6%。对不同CRF阶数和偏移连词阶数的评估证明了这些参数的重要性。对包含大量枚举和混合命名法术语的人工挑选的专利部分进行的评估表明，在这些情况下表现良好（F(1)度量为81.5%）。剩下的识别问题是检测通常较长术语的正确边界，尤其是当它们出现在括号或枚举中时。我们通过提供完整MEDLINE运行的结果来证明我们实现的可扩展性。

可用性

我们计划将语料库、注释指南以及条件随机场模型作为一个UIMA组件发布。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/06aa/2718657/8a3338c6e98b/btn181f1.jpg

相似文献

Detection of IUPAC and IUPAC-like chemical names.

Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Chemical name extraction based on automatic training data generation and rich feature set.

IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1218-33. doi: 10.1109/TCBB.2013.101.

ChemSpot: a hybrid system for chemical named entity recognition.

Bioinformatics. 2012 Jun 15;28(12):1633-40. doi: 10.1093/bioinformatics/bts183. Epub 2012 Apr 12.

Tagging gene and protein names in biomedical text.

Bioinformatics. 2002 Aug;18(8):1124-32. doi: 10.1093/bioinformatics/18.8.1124.

Building a protein name dictionary from full text: a machine learning term extraction approach.

BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

An entity tagger for recognizing acquired genomic variations in cancer literature.

Bioinformatics. 2004 Nov 22;20(17):3249-51. doi: 10.1093/bioinformatics/bth350. Epub 2004 Jun 4.

Protein names precisely peeled off free text.

Bioinformatics. 2004 Aug 4;20 Suppl 1:i241-7. doi: 10.1093/bioinformatics/bth904.

Drug name recognition in biomedical texts: a machine-learning-based method.

Drug Discov Today. 2014 May;19(5):610-7. doi: 10.1016/j.drudis.2013.10.006. Epub 2013 Oct 16.

引用本文的文献

Representation of Molecules by Sequences of Instructions.

J Chem Inf Model. 2025 Aug 11;65(15):7936-7955. doi: 10.1021/acs.jcim.5c00354. Epub 2025 Jul 28.

Biomedical named entity recognition based on multi-cross attention feature fusion.

PLoS One. 2024 May 28;19(5):e0304329. doi: 10.1371/journal.pone.0304329. eCollection 2024.

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing.

NPJ Comput Mater. 2023;9(1):52. doi: 10.1038/s41524-023-01003-w. Epub 2023 Apr 5.

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.

Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.

Edge Weight Updating Neural Network for Named Entity Normalization.

Neural Process Lett. 2022 Dec 21:1-22. doi: 10.1007/s11063-022-11102-2.

Recent advances in biomedical literature mining.

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa057.

NERChem: adapting NERBio to chemical patents via full-token features and named entity feature with chemical sub-class composition.

Database (Oxford). 2016 Oct 25;2016:baw135. doi: 10.1093/database/baw135.

Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.

Front Genet. 2019 Jul 9;10:600. doi: 10.3389/fgene.2019.00600. eCollection 2019.

OGER++: hybrid multi-type entity recognition.

J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.

Recognizing chemicals in patents: a comparative analysis.

J Cheminform. 2016 Oct 28;8:59. doi: 10.1186/s13321-016-0172-0. eCollection 2016.

本文引用的文献

Identifying gene-specific variations in biomedical text.

J Bioinform Comput Biol. 2007 Dec;5(6):1277-96. doi: 10.1142/s0219720007003156.

Improving the quality of published chemical names with nomenclature software.

Molecules. 2006 Nov 29;11(11):915-28. doi: 10.3390/11110915.

Reconstruction of chemical molecules from images.

Annu Int Conf IEEE Eng Med Biol Soc. 2007;2007:4609-12. doi: 10.1109/IEMBS.2007.4353366.

Mining patents using molecular similarity search.

Pac Symp Biocomput. 2007:304-15.

A reappraisal of sentence and token splitting for life sciences documents.

Stud Health Technol Inform. 2007;129(Pt 1):524-8.

Identification of new drug classification terms in textual resources.

Bioinformatics. 2007 Jul 1;23(13):i264-72. doi: 10.1093/bioinformatics/btm196.

EBIMed--text crunching to gather facts for proteins from Medline.

Bioinformatics. 2007 Jan 15;23(2):e237-44. doi: 10.1093/bioinformatics/btl302.

DrugBank: a comprehensive resource for in silico drug discovery and exploration.

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72. doi: 10.1093/nar/gkj067.

Identifying gene and protein mentions in text using conditional random fields.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. Epub 2005 May 24.

ProMiner: rule-based protein and gene entity recognition.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S14. doi: 10.1186/1471-2105-6-S1-S14. Epub 2005 May 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

检测国际纯粹与应用化学联合会（IUPAC）及类IUPAC化学名称。

Detection of IUPAC and IUPAC-like chemical names.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献