Klein Artjom, Riazanov Alexandre, Hindle Matthew M, Baker Christopher Jo
Computer Science And Applied Statistics Department, University of New Brunswick, Saint John, Canada.
J Biomed Semantics. 2014 Feb 25;5(1):11. doi: 10.1186/2041-1480-5-11.
Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems.
We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments.
We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.
由于缺乏用于突变文本挖掘系统测试和基准测试的共识评估基础设施,从文本中自动提取突变信息的实验研究受到极大阻碍。
我们提出了一种面向社区的注释和基准测试基础设施,以支持突变文本挖掘系统的开发、测试、基准测试和比较。该设计基于语义标准,其中RDF用于表示注释,OWL本体为数据提供可扩展的模式,SPARQL用于计算各种性能指标,因此在许多情况下,无需编程即可分析文本挖掘系统的结果。虽然用于生物实体和关系提取的大型基准语料库主要集中在基因、蛋白质、疾病和物种上,但我们的基准测试基础设施填补了突变信息方面的空白。核心基础设施包括:(1)用于对注释进行建模的本体;(2)用于计算性能指标的SPARQL查询;(3)大量经过人工整理的文档集合,可支持突变定位和突变影响提取实验。
我们已经开发了用于突变文本挖掘任务基准测试的主要基础设施。使用RDF和OWL作为语料库的表示方式可确保可扩展性。该基础设施适用于在多个重要场景中开箱即用,并且就其当前状态而言,已准备好供社区初步采用。