Klinger Roman, Kolárik Corinna, Fluck Juliane, Hofmann-Apitius Martin, Friedrich Christoph M
Fraunhofer Institute Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53574 Sankt Augustin, Germany.
Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.
Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools.
We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run.
We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.
诸如小信号分子或其他生物活性化学物质之类的化合物是生命科学出版物和专利中的重要实体类别。存在多种化学物质的表示法和命名法,如SMILES、InChI、IUPAC或俗名。只有SMILES和InChI名称允许直接进行结构搜索,但在生物医学文本中,俗名和类似Iupac的名称使用得更为频繁。虽然可以通过基于字典的方法找到俗名,并以此方式将其映射到相应的结构,但不可能枚举所有IUPAC名称。在这项工作中,我们提出了一种基于条件随机场(CRF)的新机器学习方法,用于在科学文本中查找IUPAC和类似IUPAC的名称,以及对其进行评估和与可用的名称到结构工具的转化率。
我们提出了一种IUPAC名称识别器,在MEDLINE语料库上的F(1)度量为85.6%。对不同CRF阶数和偏移连词阶数的评估证明了这些参数的重要性。对包含大量枚举和混合命名法术语的人工挑选的专利部分进行的评估表明,在这些情况下表现良好(F(1)度量为81.5%)。剩下的识别问题是检测通常较长术语的正确边界,尤其是当它们出现在括号或枚举中时。我们通过提供完整MEDLINE运行的结果来证明我们实现的可扩展性。
我们计划将语料库、注释指南以及条件随机场模型作为一个UIMA组件发布。