Liu Xiao, Bordes Antoine, Grandvalet Yves
BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S8. doi: 10.1186/1471-2105-16-S10-S8. Epub 2015 Jul 13.
Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline.
We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition.
每天都会产生大量的电子生物医学文档,如分子生物学报告或基因组论文。如今,这些文档主要以非结构化自由文本的形式存在,要将其录入有组织的数据库需要进行大量处理。这种组织对于信息检索至关重要,能够回答生物学、医学及相关领域研究人员和从业者的高级查询。因此,海量数据流需要高效的自动文本挖掘方法,从生物医学文本中提取高级信息,如生物医学事件。由于该领域的特殊性,常用的自然语言处理计算工具无法直接用于提取这些生物医学事件。实际上,生物医学文档包含高度特定领域的行话和句法。这些文档还描述了独特的依存关系,使得分子生物学中的文本挖掘成为一门特定学科。
我们将生物医学事件提取视为将文本实体对分类到对应事件类型的类别中。候选文本实体对被递归地提供给一个依赖支持向量机的多类分类器。这个递归过程提取以其他事件为论据的事件。与基于马尔可夫随机场的联合模型相比,我们的模型简化了推理,因此训练和预测时间更短,内存容量更低。与常用的流水线方法相比,我们的模型跳过了一个复杂的中间问题,同时更广泛地使用了文本实体之间复杂的联合特征。我们的方法专注于生物自然语言处理挑战中Genia任务的核心事件提取,在2013年版本中取得了迄今为止报告的最佳结果。