Data Science and Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.
J Biomed Inform. 2019 May;93:103156. doi: 10.1016/j.jbi.2019.103156. Epub 2019 Mar 19.
To extract and generate a valid metabolic pathway from research articles, biologists need substantial amounts of time to digest unstructured text. Text mining currently plays a central role in this research area, because it provides the ability to automatically discover useful information in a reasonable time. A text mining model can be built using a training data or a corpus in supervised manner. Unfortunately, a corpus of the domain of interest may not be always available or insufficient in practice, because a corpus construction is a labor-intensive task and needs specialist annotation. In this paper, we developed an event extraction system, a text-mining task, to extract metabolic interactions from research literature and then reconstruct metabolic pathways. The proposed system consists of the pipeline of four supervised-learning steps: named entity recognition, trigger detection, edge detection, and event reconstruction. We also introduced a multitask-learning algorithm, a transfer-learning paradigm, that can leverage additional resources of an existing source domain to facilitate a classification of the metabolic event extraction in the target domain. To demonstrate a proof of concept, edge detection, a core step in our event extraction system, was used as a case study in multitask-learning classification. The experimental results showed that the proposed event extraction system provided competitive performance against those of state-of-the-art related system. In particular, the proposed multitask-learning can improve the performance of edge detection, therefore the overall performance of the event extraction system was also improved accordingly.
为了从研究文章中提取和生成有效的代谢途径,生物学家需要大量的时间来消化非结构化文本。文本挖掘目前在该研究领域中起着核心作用,因为它提供了在合理的时间内自动发现有用信息的能力。可以使用训练数据或语料库以监督方式构建文本挖掘模型。不幸的是,在实践中,感兴趣的领域的语料库可能并不总是可用或不足,因为语料库的构建是一项劳动密集型任务,需要专门的注释。在本文中,我们开发了一个事件提取系统,这是一种文本挖掘任务,用于从研究文献中提取代谢相互作用,然后重建代谢途径。该系统由四个监督学习步骤的流水线组成:命名实体识别、触发检测、边检测和事件重建。我们还引入了一种多任务学习算法,这是一种迁移学习范例,可以利用现有源域的其他资源来促进目标域中代谢事件提取的分类。为了证明概念验证,我们将事件提取系统的核心步骤之一边检测作为多任务学习分类的案例研究。实验结果表明,所提出的事件提取系统的性能优于先进的相关系统。特别是,所提出的多任务学习可以提高边检测的性能,因此事件提取系统的整体性能也得到了相应的提高。