Yeh Alexander, Morgan Alexander, Colosimo Marc, Hirschman Lynette
The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. Epub 2005 May 24.
The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of work on text mining this literature, but comparing this work is hard because of a lack of standards for making comparisons. To address this, we worked with colleagues at the Protein Design Group, CNB-CSIC, Madrid to develop BioCreAtIvE (Critical Assessment for Information Extraction in Biology), an open common evaluation of systems on a number of biological text mining tasks. We report here on task 1A, which deals with finding mentions of genes and related entities in text. "Finding mentions" is a basic task, which can be used as a building block for other text mining tasks. The task makes use of data and evaluation software provided by the (US) National Center for Biotechnology Information (NCBI).
15 teams took part in task 1A. A number of teams achieved scores over 80% F-measure (balanced precision and recall). The teams that tried to use their task 1A systems to help on other BioCreAtIvE tasks reported mixed results.
The 80% plus F-measure results are good, but still somewhat lag the best scores achieved in some other domains such as newswire, due in part to the complexity and length of gene names, compared to person or organization names in newswire.
生物学研究文献是知识的主要宝库。随着文献数量的增加,要找到特定主题的相关信息变得更加困难。目前已有越来越多关于对这类文献进行文本挖掘的工作,但由于缺乏用于比较的标准,比较这些工作存在困难。为解决这一问题,我们与马德里西班牙国家研究委员会蛋白质设计小组的同事合作,开发了BioCreAtIvE(生物学信息提取关键评估),这是对一系列生物学文本挖掘任务的系统进行的公开通用评估。我们在此报告任务1A,该任务涉及在文本中查找基因及相关实体的提及。“查找提及”是一项基本任务,可作为其他文本挖掘任务的构建基础。该任务使用了(美国)国家生物技术信息中心(NCBI)提供的数据和评估软件。
15个团队参与了任务1A。一些团队的F值(平衡精确率和召回率)超过了80%。那些试图使用其任务1A系统来辅助完成其他BioCreAtIvE任务的团队,结果参差不齐。
超过80%的F值结果不错,但仍略落后于诸如新闻专线等其他领域所取得的最佳分数,部分原因在于与新闻专线中的人名或组织名相比,基因名称的复杂性和长度。