Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge, England.
J Chem Inf Model. 2011 Mar 28;51(3):739-53. doi: 10.1021/ci100384d. Epub 2011 Mar 9.
We have produced an open source, freely available, algorithm (Open Parser for Systematic IUPAC Nomenclature, OPSIN) that interprets the majority of organic chemical nomenclature in a fast and precise manner. This has been achieved using an approach based on a regular grammar. This grammar is used to guide tokenization, a potentially difficult problem in chemical names. From the parsed chemical name, an XML parse tree is constructed that is operated on in a stepwise manner until the structure has been reconstructed from the name. Results from OPSIN on various computer generated name/structure pair sets are presented. These show exceptionally high precision (99.8%+) and, when using general organic chemical nomenclature, high recall (98.7-99.2%). This software can serve as the basis for future open source developments of chemical name interpretation.
我们开发了一个开源、免费、算法(用于系统 IUPAC 命名法的开放解析器,OPSIN),可以快速、准确地解释大多数有机化学命名法。这是通过基于正则语法的方法实现的。该语法用于指导标记化,这在化学名称中是一个潜在的难题。从解析后的化学名称中,构建一个 XML 解析树,然后逐步操作该解析树,直到从名称中重建结构。我们展示了 OPSIN 在各种计算机生成的名称/结构对集上的结果。这些结果显示出极高的精度(99.8%+),并且在使用通用有机化学命名法时,召回率也很高(98.7-99.2%)。该软件可以作为未来开源化学命名解释开发的基础。