Department of Computer Science, University of Kentucky, United States.
Division of Hospital Medicine, Department of Internal Medicine, University of Kentucky, United States.
J Biomed Inform. 2018 Jun;82:189-199. doi: 10.1016/j.jbi.2018.05.003. Epub 2018 May 12.
Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying different causal relations between biomedical entities is also critical to understand biomedical processes. Generally, natural language processing (NLP) and machine learning are used to predict specific relations between any given pair of entities using the distant supervision approach.
To build high accuracy supervised predictive models to predict previously unknown treatment and causative relations between biomedical entities based only on semantic graph pattern features extracted from biomedical knowledge graphs.
We used 7000 treats and 2918 causes hand-curated relations from the UMLS Metathesaurus to train and test our models. Our graph pattern features are extracted from simple paths connecting biomedical entities in the SemMedDB graph (based on the well-known SemMedDB database made available by the U.S. National Library of Medicine). Using these graph patterns connecting biomedical entities as features of logistic regression and decision tree models, we computed mean performance measures (precision, recall, F-score) over 100 distinct 80-20% train-test splits of the datasets. For all experiments, we used a positive:negative class imbalance of 1:10 in the test set to model relatively more realistic scenarios.
Our models predict treats and causes relations with high F-scores of 99% and 90% respectively. Logistic regression model coefficients also help us identify highly discriminative patterns that have an intuitive interpretation. We are also able to predict some new plausible relations based on false positives that our models scored highly based on our collaborations with two physician co-authors. Finally, our decision tree models are able to retrieve over 50% of treatment relations from a recently created external dataset.
We employed semantic graph patterns connecting pairs of candidate biomedical entities in a knowledge graph as features to predict treatment/causative relations between them. We provide what we believe is the first evidence in direct prediction of biomedical relations based on graph features. Our work complements lexical pattern based approaches in that the graph patterns can be used as additional features for weakly supervised relation prediction.
为了减轻人类疾病负担,寻找新的潜在治疗方法是生物医学研究的核心任务。由于无法对所有候选药物进行动物和临床试验,因此首先尝试采用体外方法来确定有希望的候选药物。同样,确定生物医学实体之间的不同因果关系对于理解生物医学过程也至关重要。通常,使用自然语言处理 (NLP) 和机器学习通过远距离监督方法来预测给定的任意两个实体之间的特定关系。
仅基于从生物医学知识图中提取的语义图模式特征,构建高精度监督预测模型,以预测生物医学实体之间以前未知的治疗和因果关系。
我们使用 UMLS Metathesaurus 中的 7000 种治疗方法和 2918 种因果关系来训练和测试我们的模型。我们的图模式特征是从 SemMedDB 图中连接生物医学实体的简单路径中提取的(基于美国国立医学图书馆提供的著名 SemMedDB 数据库)。使用这些连接生物医学实体的图模式作为逻辑回归和决策树模型的特征,我们在数据集的 100 个不同的 80-20%训练-测试分割上计算了平均性能指标(精度、召回率、F 分数)。对于所有实验,我们在测试集中使用阳性:阴性类不平衡率为 1:10,以模拟更现实的情况。
我们的模型分别以 99%和 90%的高 F 分数预测治疗和因果关系。逻辑回归模型系数还帮助我们识别了具有直观解释的高度区分模式。根据我们与两位医师合著者的合作,我们还能够根据模型的高分预测一些新的合理关系。最后,我们的决策树模型能够从最近创建的外部数据集中检索超过 50%的治疗关系。
我们在知识图中使用连接候选生物医学实体对的语义图模式作为特征来预测它们之间的治疗/因果关系。我们提供了基于图特征直接预测生物医学关系的第一个证据。我们的工作补充了基于词汇模式的方法,因为图模式可以用作弱监督关系预测的附加特征。