Suppr超能文献

利用浅层语言分析检测和分类细菌栖息地

Detection and categorization of bacteria habitats using shallow linguistic analysis.

作者信息

Karadeniz İlknur, Özgür Arzucan

出版信息

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S5. doi: 10.1186/1471-2105-16-S10-S5. Epub 2015 Jul 13.

Abstract

BACKGROUND

Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas.

METHODS

We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach.

RESULTS

We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%.

CONCLUSIONS

Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013.

摘要

背景

有关细菌生物栖息地的信息对于包括健康科学、微生物学以及食品加工与保存在内的多个研究领域都很重要。这些领域的科学家面临的挑战之一是电子资源文本中埋藏着海量信息。开发从这些电子资源文本中自动提取细菌栖息地关系的方法对于推动这些领域的研究至关重要。

方法

我们引入一种基于语言学动机的基于规则的方法,通过使用本体来识别和规范化生物医学文本中细菌栖息地的名称。我们的方法基于对文本的浅层句法分析,包括句子分割、词性(POS)标注、部分句法分析和词形还原。此外,我们提出了两种识别细菌栖息地定位关系的方法。第一种方法的基本假设是语篇会随着新段落而变化。因此,它以段落为基础进行操作。第二种方法对文本进行更细粒度的分析,并以句子为基础进行操作。我们还开发了一种用于细菌共指消解的新颖方法,并将其与基于句子的关系提取方法相结合。

结果

我们参加了2013年生物自然语言处理共享任务的细菌生物栖息地(BB)任务。我们的系统(Boun)在子任务1(实体检测与分类)中以68%的槽错误率(SER)取得了第二好的成绩,在子任务2(定位事件提取)中以27%的F值排名第三。本文报告了为共享任务实现的系统,包括开发的新颖方法以及官方评估后取得的改进。扩展内容包括使用子任务1的训练集扩展OntoBiotope本体,以及为子任务2将新颖的基于句子的关系提取方法与指代消解相结合。这些扩展在子任务1中取得了有前景的结果,SER为68%,在子任务2中取得了53%的F值的先进性能。

结论

我们的结果表明,基于文本浅层句法分析的面向语言学的方法在检测和基于本体的栖息地实体规范化方面与机器学习方法一样有效。此外,新开发的带有指代消解模块的基于句子的关系提取系统明显优于基于段落的系统以及参加2013年BB共享任务的其他系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee69/4511461/e5d721905282/1471-2105-16-S10-S5-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验