Suppr超能文献

基因集富集分析中的排名指标:它们重要吗?

Ranking metrics in gene set enrichment analysis: do they matter?

作者信息

Zyla Joanna, Marczyk Michal, Weiner January, Polanska Joanna

机构信息

Data Mining Group, Institute of Automatic Control, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland.

Max Planck Institute for Infection Biology, Charitéplatz 1, Berlin, 10117, Germany.

出版信息

BMC Bioinformatics. 2017 May 12;18(1):256. doi: 10.1186/s12859-017-1674-0.

Abstract

BACKGROUND

There exist many methods for describing the complex relation between changes of gene expression in molecular pathways or gene ontologies under different experimental conditions. Among them, Gene Set Enrichment Analysis seems to be one of the most commonly used (over 10,000 citations). An important parameter, which could affect the final result, is the choice of a metric for the ranking of genes. Applying a default ranking metric may lead to poor results.

METHODS AND RESULTS

In this work 28 benchmark data sets were used to evaluate the sensitivity and false positive rate of gene set analysis for 16 different ranking metrics including new proposals. Furthermore, the robustness of the chosen methods to sample size was tested. Using k-means clustering algorithm a group of four metrics with the highest performance in terms of overall sensitivity, overall false positive rate and computational load was established i.e. absolute value of Moderated Welch Test statistic, Minimum Significant Difference, absolute value of Signal-To-Noise ratio and Baumgartner-Weiss-Schindler test statistic. In case of false positive rate estimation, all selected ranking metrics were robust with respect to sample size. In case of sensitivity, the absolute value of Moderated Welch Test statistic and absolute value of Signal-To-Noise ratio gave stable results, while Baumgartner-Weiss-Schindler and Minimum Significant Difference showed better results for larger sample size. Finally, the Gene Set Enrichment Analysis method with all tested ranking metrics was parallelised and implemented in MATLAB, and is available at https://github.com/ZAEDPolSl/MrGSEA .

CONCLUSIONS

Choosing a ranking metric in Gene Set Enrichment Analysis has critical impact on results of pathway enrichment analysis. The absolute value of Moderated Welch Test has the best overall sensitivity and Minimum Significant Difference has the best overall specificity of gene set analysis. When the number of non-normally distributed genes is high, using Baumgartner-Weiss-Schindler test statistic gives better outcomes. Also, it finds more enriched pathways than other tested metrics, which may induce new biological discoveries.

摘要

背景

存在多种方法可用于描述在不同实验条件下分子途径或基因本体中基因表达变化之间的复杂关系。其中,基因集富集分析似乎是最常用的方法之一(引用次数超过10000次)。一个可能影响最终结果的重要参数是用于基因排名的度量标准的选择。应用默认的排名度量标准可能会导致结果不佳。

方法与结果

在这项工作中,使用了28个基准数据集来评估16种不同排名度量标准(包括新提出的标准)的基因集分析的敏感性和假阳性率。此外,还测试了所选方法对样本量的稳健性。使用k均值聚类算法,建立了一组在总体敏感性、总体假阳性率和计算负荷方面表现最佳的四个度量标准,即适度韦尔奇检验统计量的绝对值、最小显著差异、信噪比的绝对值和鲍姆加特纳 - 魏斯 - 辛德勒检验统计量。在假阳性率估计方面,所有选定的排名度量标准对样本量都具有稳健性。在敏感性方面,适度韦尔奇检验统计量的绝对值和信噪比的绝对值给出了稳定的结果,但鲍姆加特纳 - 魏斯 - 辛德勒检验统计量和最小显著差异在样本量较大时表现更好。最后,将所有测试排名度量标准的基因集富集分析方法并行化并在MATLAB中实现,可在https://github.com/ZAEDPolSl/MrGSEA获取。

结论

在基因集富集分析中选择排名度量标准对途径富集分析的结果具有关键影响。适度韦尔奇检验的绝对值具有最佳的总体敏感性,最小显著差异具有基因集分析的最佳总体特异性。当非正态分布基因数量较多时,使用鲍姆加特纳 - 魏斯 - 辛德勒检验统计量会得到更好的结果。此外,它比其他测试度量标准发现更多富集的途径,这可能会引发新的生物学发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b694/5427619/33c7e78fbdad/12859_2017_1674_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验