Wu Yonghui, Liu Mei, Zheng W Jim, Zhao Zhongming, Xu Hua
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37203, USA.
Pac Symp Biocomput. 2012:422-33.
Drug responses vary greatly among individuals due to human genetic variations, which is known as pharmacogenomics (PGx). Much of the PGx knowledge has been embedded in biomedical literature and there is a growing interest to develop text mining approaches to extract such knowledge. In this paper, we present a study to rank candidate gene-drug relations using Latent Dirichlet Allocation (LDA) model. Our approach consists of three steps: 1) recognize gene and drug entities in MEDLINE abstracts; 2) extract candidate gene-drug pairs based on different levels of co-occurrence, including abstract level, sentence level, and phrase level; and 3) rank candidate gene-drug pairs using multiple different methods including term frequency, Chi-square test, Mutual Information (MI), a reported Kullback-Leibler (KL) distance based on topics derived from LDA (LDA-KL), and a newly defined probabilistic KL distance based on LDA (LDA-PKL). We systematically evaluated these methods by using a gold standard data set of gene-drug relations derived from PharmGKB. Our results showed that the proposed LDA-PKL method achieved better Mean Average Precision (MAP) than any other methods, suggesting its promising uses for ranking and detecting PGx relations.
由于人类基因变异,个体对药物的反应差异很大,这就是所谓的药物基因组学(PGx)。许多PGx知识已嵌入生物医学文献中,并且人们越来越有兴趣开发文本挖掘方法来提取此类知识。在本文中,我们提出了一项使用潜在狄利克雷分配(LDA)模型对候选基因-药物关系进行排名的研究。我们的方法包括三个步骤:1)在MEDLINE摘要中识别基因和药物实体;2)基于不同的共现水平提取候选基因-药物对,包括摘要水平、句子水平和短语水平;3)使用多种不同方法对候选基因-药物对进行排名,包括词频、卡方检验、互信息(MI)、基于从LDA导出的主题的报告的库尔贝克-莱布勒(KL)距离(LDA-KL)以及基于LDA新定义的概率KL距离(LDA-PKL)。我们使用来自PharmGKB的基因-药物关系金标准数据集系统地评估了这些方法。我们的结果表明,所提出的LDA-PKL方法比任何其他方法都具有更好的平均精度均值(MAP),表明其在排名和检测PGx关系方面具有广阔的应用前景。