Suppr超能文献

在用于基因和蛋白质名称识别的字典方法中提高召回率的技术评估。

Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification.

作者信息

Schuemie Martijn J, Mons Barend, Weeber Marc, Kors Jan A

机构信息

Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, The Netherlands.

出版信息

J Biomed Inform. 2007 Jun;40(3):316-24. doi: 10.1016/j.jbi.2006.09.002. Epub 2006 Sep 24.

Abstract

Gene and protein name identification in text requires a dictionary approach to relate synonyms to the same gene or protein, and to link names to external databases. However, existing dictionaries are incomplete. We investigate two complementary methods for automatic generation of a comprehensive dictionary: combination of information from existing gene and protein databases and rule-based generation of spelling variations. Both methods have been reported in literature before, but have hitherto not been combined and evaluated systematically. We combined gene and protein names from several existing databases of four different organisms. The combined dictionaries showed a substantial increase in recall on three different test sets, as compared to any single database. Application of 23 spelling variation rules to the combined dictionaries further increased recall. However, many rules appeared to have no effect and some appear to have a detrimental effect on precision.

摘要

文本中基因和蛋白质名称的识别需要采用字典方法,将同义词关联到同一基因或蛋白质,并将名称链接到外部数据库。然而,现有的字典并不完整。我们研究了两种用于自动生成综合字典的互补方法:整合来自现有基因和蛋白质数据库的信息以及基于规则生成拼写变体。这两种方法之前都在文献中有所报道,但迄今为止尚未进行系统的组合和评估。我们整合了来自四种不同生物体的几个现有数据库中的基因和蛋白质名称。与任何单个数据库相比,整合后的字典在三个不同测试集上的召回率有显著提高。将23条拼写变体规则应用于整合后的字典进一步提高了召回率。然而,许多规则似乎没有效果,有些规则似乎对精确率有不利影响。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验