Suppr超能文献

用于评估和增强全基因组关联研究结果的无监督文本挖掘

Unsupervised text mining for assessing and augmenting GWAS results.

作者信息

Ailem Melissa, Role François, Nadif Mohamed, Demenais Florence

机构信息

LIPADE, Université Paris Descartes, Sorbonne Paris Cité, Paris F-75006, France.

INSERM, Genetic Variation and Human Diseases Unit, UMR-946, Paris F-75010, France; Institut Universitaire d'Hématologie, Université Paris Diderot, Sorbonne Paris Cité, Paris F-75010, France.

出版信息

J Biomed Inform. 2016 Apr;60:252-9. doi: 10.1016/j.jbi.2016.02.008. Epub 2016 Feb 19.

Abstract

Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma.

摘要

文本挖掘有助于对大规模生物医学数据进行分析和解读,帮助生物学家快速且低成本地确认生物实体之间假设关系的真实性。我们将这个问题置于全基因组关联研究(GWAS)的背景下,这是一个正在积极兴起的领域,它有助于识别许多与多因素疾病相关的基因。这些研究能够识别与同一表型相关的基因群组,但并未提供这些基因之间关系的信息。因此,我们的目标是利用无监督文本挖掘技术,通过基于文本的余弦相似度比较和应用于候选基因和随机基因向量的聚类,来增强GWAS的结果。我们提出了一个通用框架,并用它来表征先前一项GWAS报告的与哮喘相关的10个基因之间的关系。该实验结果表明,这10个基因之间的相似性显著强于随机预期(单侧p值<0.01)。对观察到的基因和随机选择的基因进行聚类,也能够生成关于这些基因之间潜在功能关系的假设,从而有助于发现哮喘的新候选基因。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验