使用有噪声的文献衍生知识图谱进行药物不良事件预测：算法开发与验证

Adverse Drug Event Prediction Using Noisy Literature-Derived Knowledge Graphs: Algorithm Development and Validation.

作者信息

Dasgupta Soham, Jayagopal Aishwarya, Jun Hong Abel Lim, Mariappan Ragunathan, Rajan Vaibhav

机构信息

Mallya Aditi International School, Bangalore, India.

School of Computing, National University of Singapore, Singapore, Singapore.

出版信息

JMIR Med Inform. 2021 Oct 25;9(10):e32730. doi: 10.2196/32730.

DOI:10.2196/32730

PMID:34694230

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8576589/

Abstract

BACKGROUND

Adverse drug events (ADEs) are unintended side effects of drugs that cause substantial clinical and economic burdens globally. Not all ADEs are discovered during clinical trials; therefore, postmarketing surveillance, called pharmacovigilance, is routinely conducted to find unknown ADEs. A wealth of information, which facilitates ADE discovery, lies in the growing body of biomedical literature. Knowledge graphs (KGs) encode information from the literature, where the vertices and the edges represent clinical concepts and their relations, respectively. The scale and unstructured form of the literature necessitates the use of natural language processing (NLP) to automatically create such KGs. Previous studies have demonstrated the utility of such literature-derived KGs in ADE prediction. Through unsupervised learning of the representations (features) of clinical concepts from the KG, which are used in machine learning models, state-of-the-art results for ADE prediction were obtained on benchmark data sets.

OBJECTIVE

Due to the use of NLP to infer literature-derived KGs, there is noise in the form of false positive (erroneous) and false negative (absent) nodes and edges. Previous representation learning methods do not account for such inaccuracies in the graph. NLP algorithms can quantify the confidence in their inference of extracted concepts and relations from the literature. Our hypothesis, which motivates this work, is that by using such confidence scores during representation learning, the learned embeddings would yield better features for ADE prediction models.

METHODS

We developed methods to use these confidence scores on two well-known representation learning methods-DeepWalk and Translating Embeddings for Modeling Multi-relational Data (TransE)-to develop their weighted versions: Weighted DeepWalk and Weighted TransE. These methods were used to learn representations from a large literature-derived KG, the Semantic MEDLINE Database, which contains more than 93 million clinical relations. They were compared with Embedding of Semantic Predications, which, to our knowledge, is the best reported representation learning method using the Semantic MEDLINE Database with state-of-the-art results for ADE prediction. Representations learned from different methods were used (separately) as features of drugs and diseases to build classification models for ADE prediction using benchmark data sets. The methods were compared rigorously over multiple cross-validation settings.

RESULTS

The weighted versions we designed were able to learn representations that yielded more accurate predictive models than the corresponding unweighted versions of both DeepWalk and TransE, as well as Embedding of Semantic Predications, in our experiments. There were performance improvements of up to 5.75% in the F-score and 8.4% in the area under the receiver operating characteristic curve value, thus advancing the state of the art in ADE prediction from literature-derived KGs.

CONCLUSIONS

Our classification models can be used to aid pharmacovigilance teams in detecting potentially new ADEs. Our experiments demonstrate the importance of modeling inaccuracies in the inferred KGs for representation learning.

摘要

背景

药物不良事件（ADEs）是药物的意外副作用，在全球范围内造成了巨大的临床和经济负担。并非所有的药物不良事件都能在临床试验中被发现；因此，上市后监测，即药物警戒，通常会被开展以发现未知的药物不良事件。大量有助于发现药物不良事件的信息存在于不断增长的生物医学文献中。知识图谱（KGs）对文献中的信息进行编码，其中顶点和边分别代表临床概念及其关系。文献的规模和非结构化形式使得有必要使用自然语言处理（NLP）来自动创建此类知识图谱。先前的研究已经证明了这种从文献中衍生的知识图谱在药物不良事件预测中的效用。通过从知识图谱中对临床概念的表示（特征）进行无监督学习，并将其用于机器学习模型，在基准数据集上获得了药物不良事件预测的最新成果。

目的

由于使用NLP来推断从文献中衍生的知识图谱，存在误报（错误的）和漏报（缺失的）节点及边形式的噪声。先前的表示学习方法没有考虑图中的此类不准确性。NLP算法可以量化其从文献中提取概念和关系的推断的置信度。激发这项工作的我们的假设是，通过在表示学习期间使用此类置信度分数，所学习的嵌入将为药物不良事件预测模型产生更好的特征。

方法

我们开发了在两种著名的表示学习方法——DeepWalk和用于多关系数据建模的翻译嵌入（TransE）——上使用这些置信度分数的方法，以开发它们的加权版本：加权DeepWalk和加权TransE。这些方法被用于从一个大型的从文献中衍生的知识图谱，即语义MEDLINE数据库中学习表示，该数据库包含超过9300万个临床关系。它们与语义谓词嵌入进行了比较，据我们所知，语义谓词嵌入是使用语义MEDLINE数据库报告的最好的表示学习方法，在药物不良事件预测方面具有最新成果。从不同方法学习到的表示被（分别）用作药物和疾病的特征，以使用基准数据集构建药物不良事件预测的分类模型。在多个交叉验证设置下对这些方法进行了严格比较。