Grego Tiago, Pesquita Catia, Bastos Hugo P, Couto Francisco M
Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal.
ISRN Bioinform. 2012 Feb 15;2012:619427. doi: 10.5402/2012/619427. eCollection 2012.
Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2-5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks.
化学实体在生物医学文献中无处不在,因此需要开发能够有效识别这些实体的文本挖掘系统。由于缺乏可用的语料库和数据资源,该领域一直致力于基因和蛋白质命名实体识别系统的开发,但随着ChEBI的发布和带注释语料库的出现,这个任务可以得到解决。我们开发了一种基于机器学习的化学实体识别方法和一种基于词汇相似度的化学实体解析方法,并将它们与基于流行词典的Whatizit方法进行了比较。在所有任务中,我们的方法都优于基于词典的方法,实体识别任务的F值提高了20%,实体解析任务提高了2 - 5%,实体识别与解析组合任务提高了15%。