Department of Computer Engineering, Bilkent University, Ankara, Turkey.
J Biomed Inform. 2009 Dec;42(6):1046-55. doi: 10.1016/j.jbi.2009.05.004. Epub 2009 May 13.
Protein name extraction, one of the basic tasks in automatic extraction of information from biological texts, remains challenging. In this paper, we explore the use of two different machine learning techniques and present the results of the conducted experiments. In the first method, Bigram language model is used to extract protein names. In the latter, we use an automatic rule learning method that can identify protein names located in the biological texts. In both cases, we generalize protein names by using hierarchically categorized syntactic token types. We conducted our experiments on two different datasets. Our first method based on Bigram language model achieved an F-score of 67.7% on the YAPEX dataset and 66.8% on the GENIA corpus. The developed rule learning method obtained 61.8% F-score value on the YAPEX dataset and 61.0% on the GENIA corpus. The results of the comparative experiments demonstrate that both techniques are applicable to the task of automatic protein name extraction, a prerequisite for the large-scale processing of biomedical literature.
蛋白质名称提取是从生物文本中自动提取信息的基本任务之一,仍然具有挑战性。在本文中,我们探索了两种不同的机器学习技术的应用,并呈现了所进行实验的结果。在第一种方法中,使用二元语法语言模型提取蛋白质名称。在后一种方法中,我们使用一种自动规则学习方法,可以识别位于生物文本中的蛋白质名称。在这两种情况下,我们通过使用分层分类的句法标记类型来泛化蛋白质名称。我们在两个不同的数据集上进行了实验。我们基于二元语法语言模型的第一种方法在 YAPEX 数据集上获得了 67.7%的 F 分数,在 GENIA 语料库上获得了 66.8%的 F 分数。开发的规则学习方法在 YAPEX 数据集上获得了 61.8%的 F 分数,在 GENIA 语料库上获得了 61.0%的 F 分数。比较实验的结果表明,这两种技术都适用于自动蛋白质名称提取任务,这是大规模处理生物医学文献的前提。