Wang Liqin, Haug Peter J, Del Fiol Guilherme
Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA; Homer Warner Research Center, Intermountain Healthcare, 5121 South Cottonwood Street, Murray, UT 84107, USA.
Department of Biomedical Informatics, University of Utah, 421 Wakara Way, Salt Lake City, UT 84108, USA; Homer Warner Research Center, Intermountain Healthcare, 5121 South Cottonwood Street, Murray, UT 84107, USA.
J Biomed Inform. 2017 May;69:259-266. doi: 10.1016/j.jbi.2017.04.014. Epub 2017 Apr 20.
Mining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies.
We initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: (1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and (2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s).
Simple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value<0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease's dataset.
It is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases.
从现有知识资源中挖掘疾病特异性关联对于构建疾病特异性本体和支持基于知识的应用可能是有用的。许多关联挖掘技术已经被利用。然而,当提取的关联包含大量噪声时,挑战依然存在。通过简单地在多个相关性分数上设置任意截止点来确定关联的相关性是不可靠的;并且让人类专家手动审查大量关联的成本很高。我们提出基于机器学习的分类可用于将信号与噪声分离,并为创建和维护疾病特异性词汇提供一种可行的方法。
为了简单起见,我们最初专注于疾病 - 药物关联。对于感兴趣的疾病,我们从生物医学文献引用和本地临床数据存储库中提取潜在的与治疗相关的药物概念。每个概念都与多个相关性度量(即特征)相关联,例如出现频率。出于机器学习的目的,我们为三种疾病形成了九个数据集,每种疾病有两个单源数据集和一个来自前两个数据集的组合。所有数据集都使用现有的参考标准进行标记。此后,我们进行了两个实验:(1)测试添加临床数据存储库中的特征是否会提高仅使用生物医学文献中的特征所实现的分类性能,以及(2)确定使用已知药物 - 疾病数据集训练的分类器是否可推广到新疾病。
简单逻辑回归和LogitBoost分别被确定为生物医学文献数据集和组合数据集的首选模型。使用组合特征进行分类的性能比仅使用生物医学文献特征有显著提高(p值<0.001)。从已知疾病构建以预测新疾病相关概念的分类器的性能与使用新疾病数据集构建和测试的分类器的性能没有显著差异。
使用分类方法自动预测概念与感兴趣疾病的相关性是可行的。将来自不同来源的特征组合用于分类任务是有用的。从已知疾病构建的分类器可推广到新疾病。