Language Intelligence and Information Retrieval Lab, KU Leuven, Belgium; Department of Computer Science, Celestijnenlaan 200 A, Leuven, Belgium.
J Biomed Inform. 2021 Jul;119:103820. doi: 10.1016/j.jbi.2021.103820. Epub 2021 May 24.
The identification of causal relationships between events or entities within biomedical texts is of great importance for creating scientific knowledge bases and is also a fundamental natural language processing (NLP) task. A causal (cause-effect) relation is defined as an association between two events in which the first must occur before the second. Although this task is an open problem in artificial intelligence, and despite its important role in information extraction from the biomedical literature, very few works have considered this problem. However, with the advent of new techniques in machine learning, especially deep neural networks, research increasingly addresses this problem. This paper summarizes state-of-the-art research, its applications, existing datasets, and remaining challenges. For this survey we have implemented and evaluated various techniques including a Multiview CNN (MVC), attention-based BiLSTM models and state-of-the-art word embedding models, such as those obtained with bidirectional encoder representations (ELMo) and transformer architectures (BioBERT). In addition, we have evaluated a graph LSTM as well as a baseline rule based system. We have investigated the class imbalance problem as an innate property of annotated data in this type of task. The results show that a considerable improvement of the results of state-of-the-art systems can be achieved when a simple random oversampling technique for data augmentation is used in order to reduce class imbalance.
在生物医学文本中识别事件或实体之间的因果关系对于创建科学知识库非常重要,也是自然语言处理 (NLP) 的基本任务。因果关系(因果关系)定义为两个事件之间的关联,其中第一个事件必须先于第二个事件发生。尽管这个任务在人工智能中是一个开放性问题,尽管它在从生物医学文献中提取信息方面具有重要作用,但很少有作品考虑过这个问题。然而,随着机器学习新技术的出现,特别是深度学习神经网络的出现,研究越来越多地解决了这个问题。本文总结了最新的研究、应用、现有数据集和遗留挑战。为此调查,我们实现和评估了各种技术,包括多视图卷积神经网络 (MVC)、基于注意力的 BiLSTM 模型和最先进的单词嵌入模型,例如使用双向编码器表示 (ELMo) 和转换器架构 (BioBERT) 获得的模型。此外,我们还评估了图 LSTM 和基于规则的基线系统。我们研究了这种任务中注释数据固有的类不平衡问题。结果表明,当使用简单的随机过采样技术进行数据增强以减少类不平衡时,当前最先进系统的结果可以得到相当大的改进。