Tsafnat Guy, Jasch Dennis, Misra Agam, Choong Miew Keen, Lin Frank P-Y, Coiera Enrico
Centre for Health Informatics, University of New South Wales, Sydney, Australia.
Centre for Health Informatics, University of New South Wales, Sydney, Australia.
J Biomed Inform. 2014 Jun;49:221-6. doi: 10.1016/j.jbi.2014.03.007. Epub 2014 Mar 27.
Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality.
Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the article's MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC).
Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing.
Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date.
Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs.
The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.
基因集富集分析(GSEA)利用生物医学文献中的功能信息对基因微阵列数据进行注释,以改善基因与疾病关联的预测。我们假设,用从科学文献中提取的信息自动构建的综合基因功能目录来补充GSEA,将显著提高GSEA的预测质量。
从文献中获取乳腺癌(BrCa)和结直肠癌(CRC)的金标准基因集。自动生成了两个基因功能目录(CMeSH和CUMLS)。1. 通过使用Entrez基因将所有记录的人类基因与PubMed文章ID相关联。2. 使用每篇PubMed文章中提到的基因,并将每个基因与文章的MeSH术语(在CMeSH中)和提取的UMLS概念(在CUMLS中)相关联。然后使用CMeSH和CUMLS对来自基因表达综合数据库的BrCa和CRC微阵列数据进行注释,并且为了进行比较,还使用了几个现有的目录(来自分子特征数据库的C2、C4和C5)。使用标准的GSEA实现(GSEA-p)进行排名。通过测量受试者工作特征曲线下面积(AUC),根据金标准评估富集阵列数据的基因功能预测。
使用文献富集目录、现有目录以及五个随机生成的目录进行排名比较,结果表明基于文献的富集目录更有效。使用未富集基因表达数据集时,BrCa的AUC为0.43,使用CUMLS进行基因集富集后增加到0.89。使用未富集基因表达数据集时,CRC的AUC为0.54,使用CMeSH富集后增加到0.9。C2提高了AUC(BrCa为0.76,CRC为0.71),但C4和C5表现不佳(在0.35至0.5之间)。随机生成的目录也表现不佳,相当于随机猜测。
基因集富集显著改善了基因与疾病关联的预测。富集目录的选择对预测准确性有重大影响。基于文献的目录比MSigDB目录表现更好,可能因为它们更新。从文献中自动生成的目录可以保持更新。
基因与疾病关联的预测是生物医学研究中的一项基本任务。当使用基于文献的富集目录时,GSEA提供了一种有前景的方法。
本研究中生成和使用的基于文献的目录可从http://www2.chi.unsw.edu.au/literature-enrichment获取。