Finkel Jenny, Dingare Shipra, Manning Christopher D, Nissim Malvina, Alex Beatrice, Grover Claire
Department of Computer Science, Stanford University, Stanford, CA 94305-9040, USA.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S5. doi: 10.1186/1471-2105-6-S1-S5. Epub 2005 May 24.
Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools.
We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts.
This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the "open" evaluation and a precision of 0.78 and recall of 0.85 in the "closed" evaluation.
Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches.
优秀的自动信息提取工具为处理数量激增的生物医学文献的自动化过程带来了希望,而成功的命名实体识别是此类工具的关键组成部分。
我们提出了一个基于最大熵的系统,该系统结合了多种不同的特征,用于识别生物医学摘要中的基因和蛋白质名称。
该系统参加了生物创造性比较评估,在“开放”评估中精确率达到0.83,召回率达到0.84;在“封闭”评估中精确率为0.78,召回率为0.85。
主要贡献在于在多个粒度级别丰富使用从训练数据派生的特征,专注于正确识别实体边界,以及创新性地使用包括完整MEDLINE摘要和网络搜索在内的多种外部知识源。