School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Buk-gu, Gwangju, South Korea.
PLoS One. 2019 Aug 28;14(8):e0221582. doi: 10.1371/journal.pone.0221582. eCollection 2019.
Many new medicines have been derived from natural sources such as plants, which have a long history of being used for disease treatment. Thus, their benefits and side effects have been studied, and plant-related information including plant and disease relations have been accumulated in Medline articles. Because numerous articles are available in Medline and are written in natural language, text-mining is important. However, a corpus of plant and disease relations is not available yet. Thus, we aimed to construct such a corpus.
In this study, we designed and annotated a plant-disease relations corpus, and proposed a computational model to predict plant-disease relations using the corpus. We categorized plant and disease relations into four types: treatments of diseases, causes of diseases, associations, and negative relations. To construct a corpus of plant-disease relations, we first created its annotation guidelines and randomly selected 200 Medline abstracts. From these abstracts, we identified 1,405 and 1,755 plant and disease mentions, annotated to 105 and 237 unique plant and disease identifiers, respectively. When we selected sentences containing at least one plant and one disease mention, we extracted 878 plant and 1,077 disease entities, which finally generated a corpus of plant-disease relations including 1,309 relations from 199 abstracts. To verify the effectiveness of the corpus, we proposed a convolutional neural network model with the shortest dependency path (SDP-CNN) and applied it to the constructed corpus. The micro F-score with ten-fold cross-validation was found to be 0.764. We also applied the proposed SDP-CNN model to all Medline abstracts. When we measured its performance for 483 randomly selected plant-disease co-occurring sentences, the model showed a precision of 0.707.
The plant-disease relations corpus is unique and represents an important resource for biomedical text-mining. The corpus of plant and disease relations is available at http://gcancer.org/pdr/.
许多新药源自植物等天然来源,而植物用于治疗疾病的历史由来已久。因此,人们对其益处和副作用进行了研究,并在 Medline 文章中积累了与植物和疾病相关的信息,包括植物与疾病的关系。由于 Medline 中有大量的文章,且都是以自然语言的形式呈现,因此文本挖掘很重要。但是,目前还没有关于植物和疾病关系的语料库。因此,我们旨在构建这样一个语料库。
在这项研究中,我们设计并注释了一个植物-疾病关系语料库,并提出了一种使用该语料库预测植物-疾病关系的计算模型。我们将植物和疾病关系分为四类:疾病的治疗、疾病的病因、关联和负相关。为了构建一个植物-疾病关系语料库,我们首先创建了其注释指南,并随机选择了 200 篇 Medline 摘要。从这些摘要中,我们分别确定了 1405 个和 1755 个植物和疾病提及,并分别标注了 105 个和 237 个唯一的植物和疾病标识符。当我们选择包含至少一个植物和一个疾病提及的句子时,我们从 199 篇摘要中提取了 878 个植物和 1077 个疾病实体,最终生成了一个包含 1309 个关系的植物-疾病关系语料库。为了验证语料库的有效性,我们提出了一个具有最短依赖路径的卷积神经网络模型(SDP-CNN),并将其应用于构建的语料库。通过十折交叉验证,得到的微观 F1 值为 0.764。我们还将提出的 SDP-CNN 模型应用于所有 Medline 摘要。当我们在 483 个随机选择的植物-疾病共现句子上测量其性能时,该模型的准确率为 0.707。
该植物-疾病关系语料库是独特的,是生物医学文本挖掘的重要资源。植物和疾病关系语料库可在 http://gcancer.org/pdr/ 获得。