疾病：疾病-基因关联的文本挖掘与数据整合

DISEASES: text mining and data integration of disease-gene associations.

作者信息

Pletscher-Frankild Sune, Pallejà Albert, Tsafou Kalliopi, Binder Janos X, Jensen Lars Juhl

机构信息

Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.

Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark; Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.

出版信息

Methods. 2015 Mar;74:83-9. doi: 10.1016/j.ymeth.2014.11.020. Epub 2014 Dec 5.

DOI:10.1016/j.ymeth.2014.11.020

PMID:25484339

Abstract

Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.

摘要

文本挖掘是一项灵活的技术，可应用于生物学和医学中的众多不同任务。我们提出了一个从生物医学摘要中提取疾病-基因关联的系统。该系统由一个高效的基于字典的标注器组成，用于对人类基因和疾病进行命名实体识别，我们将其与一种评分方案相结合，该方案考虑了句子内部和句子之间的共现情况。我们表明，这种方法能够提取所有人工整理关联的一半，假阳性率仅为0.16%。尽管如此，文本挖掘不应孤立存在，而应与其他类型的证据相结合。出于这个原因，我们开发了DISEASES资源，它将文本挖掘的结果与人工整理的疾病-基因关联、癌症突变数据以及来自现有数据库的全基因组关联研究整合在一起。可通过网页界面http://diseases.jensenlab.org/访问DISEASES资源，在该界面上，文本挖掘软件和所有关联也可免费下载。