Yeh Alexander S, Hirschman Lynette, Morgan Alexander A
The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA.
Bioinformatics. 2003;19 Suppl 1:i331-9. doi: 10.1093/bioinformatics/btg1046.
The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful.
We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new ('blind') articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the evaluation results and describe the techniques used by the top performing groups.
生物学文献是知识的主要宝库。许多生物学数据库的大部分内容都来自对这些文献的精心编目。然而,随着文献数量的增加,编目的负担也在加重。文本挖掘可能提供有用的工具来协助编目过程。迄今为止,由于缺乏标准,无法确定文本挖掘技术是否足够成熟以发挥作用。
我们报告了一项为知识发现与数据挖掘(KDD)挑战赛创建的挑战评估任务。我们提供了一个由862篇文章组成的训练语料库,这些文章包括在FlyBase中编目的期刊文章,以及相关的基因和基因产物列表,以及来自FlyBase的相关数据字段。对于测试,我们提供了一个由213篇新的(“盲”)文章组成的语料库;18个参与小组提供了基于文章是否包含基因表达产物实验证据来标记文章以供编目的系统。我们报告了评估结果,并描述了表现最佳的小组所使用的技术。