Suppr超能文献

检测国际纯粹与应用化学联合会(IUPAC)及类IUPAC化学名称。

Detection of IUPAC and IUPAC-like chemical names.

作者信息

Klinger Roman, Kolárik Corinna, Fluck Juliane, Hofmann-Apitius Martin, Friedrich Christoph M

机构信息

Fraunhofer Institute Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53574 Sankt Augustin, Germany.

出版信息

Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.

Abstract

MOTIVATION

Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools.

RESULTS

We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run.

AVAILABILITY

We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.

摘要

动机

诸如小信号分子或其他生物活性化学物质之类的化合物是生命科学出版物和专利中的重要实体类别。存在多种化学物质的表示法和命名法,如SMILES、InChI、IUPAC或俗名。只有SMILES和InChI名称允许直接进行结构搜索,但在生物医学文本中,俗名和类似Iupac的名称使用得更为频繁。虽然可以通过基于字典的方法找到俗名,并以此方式将其映射到相应的结构,但不可能枚举所有IUPAC名称。在这项工作中,我们提出了一种基于条件随机场(CRF)的新机器学习方法,用于在科学文本中查找IUPAC和类似IUPAC的名称,以及对其进行评估和与可用的名称到结构工具的转化率。

结果

我们提出了一种IUPAC名称识别器,在MEDLINE语料库上的F(1)度量为85.6%。对不同CRF阶数和偏移连词阶数的评估证明了这些参数的重要性。对包含大量枚举和混合命名法术语的人工挑选的专利部分进行的评估表明,在这些情况下表现良好(F(1)度量为81.5%)。剩下的识别问题是检测通常较长术语的正确边界,尤其是当它们出现在括号或枚举中时。我们通过提供完整MEDLINE运行的结果来证明我们实现的可扩展性。

可用性

我们计划将语料库、注释指南以及条件随机场模型作为一个UIMA组件发布。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/06aa/2718657/8a3338c6e98b/btn181f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验