Neves Mariana
Hasso-Plattner-Institut, Potsdam Universität, Potsdam, Germany.
F1000Res. 2014 Apr 25;3:96. doi: 10.12688/f1000research.3216.1. eCollection 2014.
Collection of documents annotated with semantic entities and relationships are crucial resources to support development and evaluation of text mining solutions for the biomedical domain. Here I present an overview of 36 corpora and show an analysis on the semantic annotations they contain. Annotations for entity types were classified into six semantic groups and an overview on the semantic entities which can be found in each corpus is shown. Results show that while some semantic entities, such as genes, proteins and chemicals are consistently annotated in many collections, corpora available for diseases, variations and mutations are still few, in spite of their importance in the biological domain.
带有语义实体和关系注释的文档集合是支持生物医学领域文本挖掘解决方案开发和评估的关键资源。在此,我概述了36个语料库,并对它们包含的语义注释进行了分析。实体类型的注释被分为六个语义组,并展示了每个语料库中可找到的语义实体概述。结果表明,虽然一些语义实体,如基因、蛋白质和化学物质在许多集合中都有一致的注释,但尽管疾病、变异和突变在生物领域很重要,可用于它们的语料库仍然很少。