Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz019.
Gene-phenotype associations play an important role in understanding the disease mechanisms which is a requirement for treatment development. A portion of gene-phenotype associations are observed mainly experimentally and made publicly available through several standard resources such as MGI. However, there is still a vast amount of gene-phenotype associations buried in the biomedical literature. Given the large amount of literature data, we need automated text mining tools to alleviate the burden in manual curation of gene-phenotype associations and to develop comprehensive resources. In this study, we present an ontology-based approach in combination with statistical methods to text mine gene-phenotype associations from the literature. Our method achieved AUC values of 0.90 and 0.75 in recovering known gene-phenotype associations from HPO and MGI respectively. We posit that candidate genes and their relevant diseases should be expressed with similar phenotypes in publications. Thus, we demonstrate the utility of our approach by predicting disease candidate genes based on the semantic similarities of phenotypes associated with genes and diseases. To the best of our knowledge, this is the first study using an ontology based approach to extract gene-phenotype associations from the literature. We evaluated our disease candidate prediction model on the gene-disease associations from MGI. Our model achieved AUC values of 0.90 and 0.87 on OMIM (human) and MGI (mouse) datasets of gene-disease associations respectively. Our manual analysis on the text mined data revealed that our method can accurately extract gene-phenotype associations which are not currently covered by the existing public gene-phenotype resources. Overall, results indicate that our method can precisely extract known as well as new gene-phenotype associations from literature. All the data and methods are available at https://github.com/bio-ontology-research-group/genepheno.
基因-表型关联在理解疾病机制中起着重要作用,而这是开发治疗方法的必要条件。一部分基因-表型关联主要是通过实验观察到的,并通过 MGI 等几个标准资源公开发布。然而,仍有大量的基因-表型关联隐藏在生物医学文献中。鉴于文献数据量庞大,我们需要自动化的文本挖掘工具来减轻人工整理基因-表型关联的负担,并开发全面的资源。在这项研究中,我们提出了一种基于本体的方法,结合统计方法,从文献中挖掘基因-表型关联。我们的方法在从 HPO 和 MGI 中恢复已知的基因-表型关联时,AUC 值分别达到了 0.90 和 0.75。我们假设候选基因及其相关疾病在出版物中应该具有相似的表型。因此,我们通过基于与基因和疾病相关的表型的语义相似性来预测疾病候选基因,展示了我们方法的实用性。据我们所知,这是首次使用基于本体的方法从文献中提取基因-表型关联的研究。我们在 MGI 的基因-疾病关联上评估了我们的疾病候选预测模型。我们的模型在 OMIM(人类)和 MGI(小鼠)基因-疾病关联数据集上的 AUC 值分别达到了 0.90 和 0.87。我们对挖掘到的数据进行了手动分析,结果表明我们的方法可以准确地提取目前尚未涵盖在现有公共基因-表型资源中的基因-表型关联。总体而言,结果表明我们的方法可以从文献中精确地提取已知和新的基因-表型关联。所有的数据和方法都可以在 https://github.com/bio-ontology-research-group/genepheno 上获取。