无限词典和自动拼写纠错改进专利的化学文本挖掘。

Improved chemical text mining of patents with infinite dictionaries and automatic spelling correction.

机构信息

NextMove Software, Cambridge, United Kingdom.

出版信息

J Chem Inf Model. 2012 Jan 23;52(1):51-62. doi: 10.1021/ci200463r. Epub 2011 Dec 28.

DOI:10.1021/ci200463r

Abstract

The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.

摘要

药物专利的文本挖掘带来了一些独特的挑战，这些挑战在其他文本挖掘领域中并未遇到。与生物信息学等领域不同，在生物信息学等领域中，感兴趣的术语数量是可数的，并且基本上是静态的，系统化学命名法可以描述无限数量的分子。因此，通常用于在专利中搜索新型治疗化合物的基于字典和本体的技术在搜索时的作用有限。此外，IUPAC 类名称的长度和组成使其更容易受到排版问题的影响：光学字符识别 (OCR) 失败、人为拼写错误以及连字符和换行问题。这项工作描述了一种名为 CaffeineFix 的新技术，旨在高效地识别自由文本中的化学名称，即使存在排版错误也是如此。经过校正的化学名称作为输入提供给名称到结构软件。这形成了一个预处理步骤，与所使用的名称到结构软件无关，并且在我们的研究中证明，它大大提高了化学文本挖掘的结果。