Lee Dong-Gi, Shin Hyunjung
Department of Industrial Engineering, Ajou University, 206 Worldcup-ro, Yeongtong-gu, Suwon, 16499, South Korea.
BMC Med Inform Decis Mak. 2017 May 18;17(Suppl 1):53. doi: 10.1186/s12911-017-0448-y.
Recently, research on human disease network has succeeded and has become an aid in figuring out the relationship between various diseases. In most disease networks, however, the relationship between diseases has been simply represented as an association. This representation results in the difficulty of identifying prior diseases and their influence on posterior diseases. In this paper, we propose a causal disease network that implements disease causality through text mining on biomedical literature.
To identify the causality between diseases, the proposed method includes two schemes: the first is the lexicon-based causality term strength, which provides the causal strength on a variety of causality terms based on lexicon analysis. The second is the frequency-based causality strength, which determines the direction and strength of causality based on document and clause frequencies in the literature.
We applied the proposed method to 6,617,833 PubMed literature, and chose 195 diseases to construct a causal disease network. From all possible pairs of disease nodes in the network, 1011 causal pairs of 149 diseases were extracted. The resulting network was compared with that of a previous study. In terms of both coverage and quality, the proposed method showed outperforming results; it determined 2.7 times more causalities and showed higher correlation with associated diseases than the existing method.
This research has novelty in which the proposed method circumvents the limitations of time and cost in applying all possible causalities in biological experiments and it is a more advanced text mining technique by defining the concepts of causality term strength.
最近,关于人类疾病网络的研究取得了成功,并已成为一种有助于厘清各种疾病之间关系的工具。然而,在大多数疾病网络中,疾病之间的关系仅仅被表示为一种关联。这种表示方式导致难以识别前驱疾病及其对后继疾病的影响。在本文中,我们提出了一种因果疾病网络,该网络通过对生物医学文献进行文本挖掘来实现疾病因果关系。
为了识别疾病之间的因果关系,所提出的方法包括两种方案:第一种是基于词典的因果关系术语强度,它基于词典分析提供各种因果关系术语的因果强度。第二种是基于频率的因果关系强度,它根据文献中的文档和子句频率来确定因果关系的方向和强度。
我们将所提出的方法应用于6,617,833篇PubMed文献,并选择了195种疾病来构建因果疾病网络。从网络中所有可能的疾病节点对中,提取了149种疾病的1011对因果关系对。将所得网络与先前研究的网络进行了比较。在所涵盖的范围和质量方面,所提出的方法均显示出更好的结果;与现有方法相比,它确定的因果关系多2.7倍,并且与相关疾病的相关性更高。
本研究具有新颖性,所提出的方法规避了在生物实验中应用所有可能因果关系时的时间和成本限制,并且通过定义因果关系术语强度的概念,它是一种更先进的文本挖掘技术。