Suppr超能文献

调整还是不调整。利用基因集分析中的灵活性如何导致过度乐观。

To Tweak or Not to Tweak. How Exploiting Flexibilities in Gene Set Analysis Leads to Overoptimism.

作者信息

Wünsch Milena, Sauer Christina, Herrmann Moritz, Hinske Ludwig Christian, Boulesteix Anne-Laure

机构信息

Institute for Medical Information Processing, Biometry, and Epidemiology, Faculty of Medicine, LMU Munich, Munich, Germany.

Munich Center for Machine Learning, Munich, Germany.

出版信息

Biom J. 2025 Feb;67(1):e70016. doi: 10.1002/bimj.70016.

Abstract

Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of genes that show enriched expression patterns between two conditions. In addition to the multitude of methods available for this task, users are typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibility can lead to uncertainty about the "right" choice, further reinforced by a lack of evidence-based guidance. Especially when their statistical experience is scarce, this uncertainty might entice users to produce preferable results using a "trial-and-error" approach. While it may seem unproblematic at first glance, this practice can be viewed as a form of "cherry-picking" and cause an optimistic bias, rendering the results nonreplicable on independent data. After this problem has attracted a lot of attention in the context of classical hypothesis testing, we now aim to raise awareness of such overoptimism in the different and more complex context of gene set analyses. We mimic a hypothetical researcher who systematically selects the analysis variants yielding their preferred results, thereby considering three distinct goals they might pursue. Using a selection of popular gene set analysis methods, we tweak the results in this way for two frequently used benchmark gene expression data sets. Our study indicates that the potential for overoptimism is particularly high for a group of methods frequently used despite being commonly criticized. We conclude by providing practical recommendations to counter overoptimism in research findings in gene set analysis and beyond.

摘要

基因集分析是一种用于分析高通量基因表达数据的常用方法,旨在识别在两种条件下显示出富集表达模式的基因集。除了有多种方法可用于此任务外,用户在创建所需输入并指定所选方法的内部参数时通常也有很多选择。这种灵活性可能会导致对“正确”选择的不确定性,而缺乏基于证据的指导则进一步加剧了这种不确定性。尤其是当他们的统计经验不足时,这种不确定性可能会诱使用户采用“试错”方法来得出更理想的结果。虽然乍一看这似乎没有问题,但这种做法可被视为一种“挑肥拣瘦”的形式,并会导致乐观偏差,使结果在独立数据上无法复制。在这个问题在经典假设检验的背景下引起了很多关注之后,我们现在旨在提高人们对基因集分析这一不同且更复杂背景下的过度乐观现象的认识。我们模拟了一位假设的研究人员,他系统地选择产生其偏好结果的分析变体,从而考虑他们可能追求的三个不同目标。使用一系列流行的基因集分析方法,我们以这种方式对两个常用的基准基因表达数据集的结果进行了调整。我们的研究表明,尽管经常受到批评,但一组常用方法的过度乐观可能性特别高。我们通过提供实用建议来结束本文,以应对基因集分析及其他领域研究结果中的过度乐观现象。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ced4/11656295/bc9664c229fb/BIMJ-67-e70016-g005.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验