• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于文献富集的基因-疾病关联

Gene-disease association with literature based enrichment.

作者信息

Tsafnat Guy, Jasch Dennis, Misra Agam, Choong Miew Keen, Lin Frank P-Y, Coiera Enrico

机构信息

Centre for Health Informatics, University of New South Wales, Sydney, Australia.

Centre for Health Informatics, University of New South Wales, Sydney, Australia.

出版信息

J Biomed Inform. 2014 Jun;49:221-6. doi: 10.1016/j.jbi.2014.03.007. Epub 2014 Mar 27.

DOI:10.1016/j.jbi.2014.03.007
PMID:24681202
Abstract

MOTIVATION

Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality.

METHODS

Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC).

RESULTS

Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing.

DISCUSSION

Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date.

CONCLUSION

Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs.

AVAILABILITY

The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.

摘要

动机

基因集富集分析(GSEA)利用生物医学文献中的功能信息对基因微阵列数据进行注释,以改善基因与疾病关联的预测。我们假设,用从科学文献中提取的信息自动构建的综合基因功能目录来补充GSEA,将显著提高GSEA的预测质量。

方法

从文献中获取乳腺癌(BrCa)和结直肠癌(CRC)的金标准基因集。自动生成了两个基因功能目录(CMeSH和CUMLS)。1. 通过使用Entrez基因将所有记录的人类基因与PubMed文章ID相关联。2. 使用每篇PubMed文章中提到的基因,并将每个基因与文章的MeSH术语(在CMeSH中)和提取的UMLS概念(在CUMLS中)相关联。然后使用CMeSH和CUMLS对来自基因表达综合数据库的BrCa和CRC微阵列数据进行注释,并且为了进行比较,还使用了几个现有的目录(来自分子特征数据库的C2、C4和C5)。使用标准的GSEA实现(GSEA-p)进行排名。通过测量受试者工作特征曲线下面积(AUC),根据金标准评估富集阵列数据的基因功能预测。

结果

使用文献富集目录、现有目录以及五个随机生成的目录进行排名比较,结果表明基于文献的富集目录更有效。使用未富集基因表达数据集时,BrCa的AUC为0.43,使用CUMLS进行基因集富集后增加到0.89。使用未富集基因表达数据集时,CRC的AUC为0.54,使用CMeSH富集后增加到0.9。C2提高了AUC(BrCa为0.76,CRC为0.71),但C4和C5表现不佳(在0.35至0.5之间)。随机生成的目录也表现不佳,相当于随机猜测。

讨论

基因集富集显著改善了基因与疾病关联的预测。富集目录的选择对预测准确性有重大影响。基于文献的目录比MSigDB目录表现更好,可能因为它们更新。从文献中自动生成的目录可以保持更新。

结论

基因与疾病关联的预测是生物医学研究中的一项基本任务。当使用基于文献的富集目录时,GSEA提供了一种有前景的方法。

可用性

本研究中生成和使用的基于文献的目录可从http://www2.chi.unsw.edu.au/literature-enrichment获取。

相似文献

1
Gene-disease association with literature based enrichment.基于文献富集的基因-疾病关联
J Biomed Inform. 2014 Jun;49:221-6. doi: 10.1016/j.jbi.2014.03.007. Epub 2014 Mar 27.
2
Gene expression analysis in clear cell renal cell carcinoma using gene set enrichment analysis for biostatistical management.基于基因集富集分析的 clear cell 肾细胞癌基因表达分析用于生物统计学管理。
BJU Int. 2011 Jul;108(2 Pt 2):E29-35. doi: 10.1111/j.1464-410X.2010.09794.x. Epub 2011 Mar 16.
3
GSEA-P: a desktop application for Gene Set Enrichment Analysis.GSEA-P:一款用于基因集富集分析的桌面应用程序。
Bioinformatics. 2007 Dec 1;23(23):3251-3. doi: 10.1093/bioinformatics/btm369. Epub 2007 Jul 20.
4
GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies.GSEA-SNP:将基因集富集分析应用于全基因组关联研究的SNP数据。
Bioinformatics. 2008 Dec 1;24(23):2784-5. doi: 10.1093/bioinformatics/btn516. Epub 2008 Oct 14.
5
Systems biology approach to identify gene network signatures for colorectal cancer.用于识别结直肠癌基因网络特征的系统生物学方法
Front Genet. 2012 May 17;3:80. doi: 10.3389/fgene.2012.00080. eCollection 2012.
6
Extensions to gene set enrichment.基因集富集的扩展
Bioinformatics. 2007 Feb 1;23(3):306-13. doi: 10.1093/bioinformatics/btl599. Epub 2006 Nov 24.
7
Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles.样本集富集分数分析:检测全基因组表达谱中单个样本的基因集富集情况。
Bioinformatics. 2006 Jul 15;22(14):e108-16. doi: 10.1093/bioinformatics/btl231.
8
SEGS: search for enriched gene sets in microarray data.SEGS:在微阵列数据中搜索富集的基因集。
J Biomed Inform. 2008 Aug;41(4):588-601. doi: 10.1016/j.jbi.2007.12.001. Epub 2007 Dec 15.
9
In silico analysis of stomach lineage specific gene set expression pattern in gastric cancer.胃癌中胃谱系特异性基因集表达模式的计算机分析。
Biochem Biophys Res Commun. 2013 Oct 4;439(4):539-46. doi: 10.1016/j.bbrc.2013.09.007. Epub 2013 Sep 8.
10
GSEA-InContext: identifying novel and common patterns in expression experiments.GSEA-InContext:在表达实验中识别新颖和常见的模式。
Bioinformatics. 2018 Jul 1;34(13):i555-i564. doi: 10.1093/bioinformatics/bty271.