Suppr超能文献

自动与手动编目多源化学词典:对文本挖掘的影响。

Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.

出版信息

J Cheminform. 2010 Mar 23;2(1):3. doi: 10.1186/1758-2946-2-3.

Abstract

BACKGROUND

Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

RESULTS

We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.

CONCLUSIONS

We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

摘要

背景

此前,我们开发了一个名为 Chemlist 的组合词典,用于根据多个公开可用的数据库识别文本中的小分子和药物,并在注释语料库上进行了测试。为了达到可接受的召回率和准确率,我们使用了许多自动和半自动处理步骤以及消歧规则。然而,仍需要研究对多源化学词典进行广泛人工编辑会对文本中的化学术语识别产生什么影响。ChemSpider 是一个经过广泛人工编辑的化学数据库,旨在建立有效的化学名称-结构关系。

结果

我们获取了仅包含手动编辑名称和同义词的 ChemSpider 组件。应用了基于规则的术语过滤、半自动手动编辑和消歧规则。我们在注释语料库上测试了 ChemSpider 词典,并将结果与 Chemlist 词典进行了比较。约 80k 个名称的 ChemSpider 词典大小仅为 Chemlist 的 1/3 到 1/4,约 300k 个。在应用过滤和消歧之前,ChemSpider 词典的准确率为 0.43,召回率为 0.19,过滤和消歧后,准确率为 0.87,召回率为 0.19。在应用过滤和消歧之前,Chemlist 词典的准确率为 0.20,召回率为 0.47,过滤和消歧后,准确率为 0.67,召回率为 0.40。

结论

我们得出以下结论:(1)ChemSpider 词典的准确率最高,但 Chemlist 词典的召回率更高,F 分数也更高;(2)基于规则的过滤和消歧对于自动生成和手动编辑的词典都非常必要,以达到较高的准确率。ChemSpider 可作为网络服务在 http://www.chemspider.com/ 上获得,Chemlist 词典以 Simple Knowledge Organization System 格式的 XML 文件免费在网络上的 http://www.biosemantics.org/chemlist 上提供。

相似文献

2
A dictionary to identify small molecules and drugs in free text.
Bioinformatics. 2009 Nov 15;25(22):2983-91. doi: 10.1093/bioinformatics/btp535. Epub 2009 Sep 16.
4
Integrating various resources for gene name normalization.
PLoS One. 2012;7(9):e43558. doi: 10.1371/journal.pone.0043558. Epub 2012 Sep 12.
5
Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification.
J Biomed Inform. 2007 Jun;40(3):316-24. doi: 10.1016/j.jbi.2006.09.002. Epub 2006 Sep 24.
6
Gene name identification and normalization using a model organism database.
J Biomed Inform. 2004 Dec;37(6):396-410. doi: 10.1016/j.jbi.2004.08.010.
9
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.
10
Ambiguity and variability of database and software names in bioinformatics.
J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.

引用本文的文献

1
A Reproducibility Crisis for Clinical Metabolomics Studies.
Trends Analyt Chem. 2024 Nov;180. doi: 10.1016/j.trac.2024.117918. Epub 2024 Aug 19.
4
Generating Flavor Molecules Using Scientific Machine Learning.
ACS Omega. 2023 Mar 15;8(12):10875-10887. doi: 10.1021/acsomega.2c07176. eCollection 2023 Mar 28.
9
Weakly supervised learning of biomedical information extraction from curated data.
BMC Bioinformatics. 2016 Jan 11;17 Suppl 1(Suppl 1):1. doi: 10.1186/s12859-015-0844-1.
10
Interference of Bilirubin in the Determination of Magnesium with Methyl Thymol Blue.
Mater Sociomed. 2015 Jun;27(3):192-4. doi: 10.5455/msm.2015.27.192-194. Epub 2015 Jun 8.

本文引用的文献

1
A dictionary to identify small molecules and drugs in free text.
Bioinformatics. 2009 Nov 15;25(22):2983-91. doi: 10.1093/bioinformatics/btp535. Epub 2009 Sep 16.
2
Cascaded classifiers for confidence-based chemical named entity recognition.
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S4. doi: 10.1186/1471-2105-9-S11-S4.
3
HMDB: a knowledgebase for the human metabolome.
Nucleic Acids Res. 2009 Jan;37(Database issue):D603-10. doi: 10.1093/nar/gkn810. Epub 2008 Oct 25.
4
Literature mining in support of drug discovery.
Brief Bioinform. 2008 Nov;9(6):479-92. doi: 10.1093/bib/bbn035. Epub 2008 Sep 27.
5
Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems.
Drug Discov Today. 2008 Sep;13(17-18):816-23. doi: 10.1016/j.drudis.2008.06.001. Epub 2008 Jul 17.
6
Detection of IUPAC and IUPAC-like chemical names.
Bioinformatics. 2008 Jul 1;24(13):i268-76. doi: 10.1093/bioinformatics/btn181.
7
Internet-based tools for communication and collaboration in chemistry.
Drug Discov Today. 2008 Jun;13(11-12):502-6. doi: 10.1016/j.drudis.2008.03.015. Epub 2008 May 9.
8
A perspective of publicly accessible/open-access chemistry databases.
Drug Discov Today. 2008 Jun;13(11-12):495-501. doi: 10.1016/j.drudis.2008.03.017. Epub 2008 May 15.
9
Chemistry for everyone.
Nature. 2008 Feb 7;451(7179):648-51. doi: 10.1038/451648a.
10
DrugBank: a knowledgebase for drugs, drug actions and drug targets.
Nucleic Acids Res. 2008 Jan;36(Database issue):D901-6. doi: 10.1093/nar/gkm958. Epub 2007 Nov 29.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验