Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
J Cheminform. 2010 Mar 23;2(1):3. doi: 10.1186/1758-2946-2-3.
Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.
We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.
We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.
此前,我们开发了一个名为 Chemlist 的组合词典,用于根据多个公开可用的数据库识别文本中的小分子和药物,并在注释语料库上进行了测试。为了达到可接受的召回率和准确率,我们使用了许多自动和半自动处理步骤以及消歧规则。然而,仍需要研究对多源化学词典进行广泛人工编辑会对文本中的化学术语识别产生什么影响。ChemSpider 是一个经过广泛人工编辑的化学数据库,旨在建立有效的化学名称-结构关系。
我们获取了仅包含手动编辑名称和同义词的 ChemSpider 组件。应用了基于规则的术语过滤、半自动手动编辑和消歧规则。我们在注释语料库上测试了 ChemSpider 词典,并将结果与 Chemlist 词典进行了比较。约 80k 个名称的 ChemSpider 词典大小仅为 Chemlist 的 1/3 到 1/4,约 300k 个。在应用过滤和消歧之前,ChemSpider 词典的准确率为 0.43,召回率为 0.19,过滤和消歧后,准确率为 0.87,召回率为 0.19。在应用过滤和消歧之前,Chemlist 词典的准确率为 0.20,召回率为 0.47,过滤和消歧后,准确率为 0.67,召回率为 0.40。
我们得出以下结论:(1)ChemSpider 词典的准确率最高,但 Chemlist 词典的召回率更高,F 分数也更高;(2)基于规则的过滤和消歧对于自动生成和手动编辑的词典都非常必要,以达到较高的准确率。ChemSpider 可作为网络服务在 http://www.chemspider.com/ 上获得,Chemlist 词典以 Simple Knowledge Organization System 格式的 XML 文件免费在网络上的 http://www.biosemantics.org/chemlist 上提供。