Suppr超能文献

识别生物医学文本中的名称:一种机器学习方法。

Recognizing names in biomedical texts: a machine learning approach.

作者信息

Zhou GuoDong, Zhang Jie, Su Jian, Shen Dan, Tan ChewLim

机构信息

Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613.

出版信息

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Abstract

MOTIVATION

With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition.

RESULTS

In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k-Nearest Neighbor (k-NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F-measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F-measure of 75.8 on the "protein" class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F-measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F-measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the "protein" class, without help of any dictionaries. We think that a F-measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the "protein" class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary.

AVAILABILITY

A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm. Technology license is available upon the bilateral agreement.

摘要

动机

分子生物学和生物医学领域存在大量文本信息,因此需要有效且高效的文献挖掘和知识发现方法,以帮助生物学家收集和利用文档中编码的知识。为了提供有组织、结构化的信息,自动识别生物医学实体名称至关重要,这对于信息检索、信息提取和自动知识获取都很重要。

结果

在本文中,我们提出了一种生物医学领域的命名实体识别系统,称为PowerBioNE。为了处理生物医学领域命名规范的特殊现象,我们提出了各种证据特征:(1)构词模式;(2)形态模式,如前缀和后缀;(3)词性;(4)中心名词触发词;(5)特殊动词触发词;(6)名称别名特征。所有这些特征通过隐马尔可夫模型(HMM)和基于HMM的命名实体识别器得到有效整合。此外,我们还提出了一种k近邻(k-NN)算法来解决系统中的数据稀疏问题。最后,我们提出了一种基于模式的后处理方法,用于从训练数据中自动提取规则,以处理级联实体名称现象。据我们所知,PowerBioNE是第一个处理级联实体名称现象的系统。评估表明,我们的系统在GENIA V3.0和V1.1的23个类别上分别达到了66.6和62.2的F值。特别是,我们的系统在GENIA V3.0的“蛋白质”类别上达到了75.8的F值。相比之下,在没有任何词典帮助的情况下,我们的系统在GENIA V1.1上比已发表的最佳结果高出7.8。这也表明,我们的HMM和k-NN算法通过有效捕捉局部上下文依赖性和解决数据稀疏问题,优于其他模型,如回退HMM、线性插值HMM、支持向量机、C4.5、C4.5规则和RIPPER。此外,对GENIA V3.0的评估表明,级联实体名称现象的后处理使F值提高了3.9。最后,错误分析表明,约一半的错误是由GENIA语料库中严格的注释方案和注释不一致造成的。这表明,在没有任何词典帮助的情况下,我们的系统在GENIA V3.0的23个类别上达到了可接受的83.6的F值,特别是在“蛋白质”类别上达到了86.2的F值。我们认为,通过改进GENIA语料库中的注释方案,如灵活的注释方案和注释一致性,并纳入合理的生物医学词典,可以在GENIA V3.0的23个类别上实现90的F值,特别是在“蛋白质”类别上实现92的F值。

可用性

演示系统可在http://textmining.i2r.a-star.edu.sg/NLS/demo.htm获取。技术许可可根据双边协议提供。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验