• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

过度代表性分析存在的两个细微问题。

Two subtle problems with overrepresentation analysis.

作者信息

Ziemann Mark, Schroeter Barry, Bora Anusuiya

机构信息

Bioinformatics Working Group, Burnet Institute, Melbourne, VIC 3004, Australia.

School of Life and Environmental Sciences, Deakin University, Geelong, VIC 3216, Australia.

出版信息

Bioinform Adv. 2024 Oct 21;4(1):vbae159. doi: 10.1093/bioadv/vbae159. eCollection 2024.

DOI:10.1093/bioadv/vbae159
PMID:39539946
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11557902/
Abstract

MOTIVATION

Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.

RESULTS

Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.

AVAILABILITY AND IMPLEMENTATION

An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).

摘要

动机

过表达分析(ORA)被广泛用于评估基因列表中功能类别相对于背景列表的富集情况。因此,ORA是解释“组学”数据(将基因列表与生物学功能和主题相关联)的关键方法。尽管ORA非常受欢迎,但我们和其他人已经注意到一些ORA工具存在两种潜在的不良行为。第一种我们称为“背景问题”,因为它涉及软件从背景列表中剔除大量未被注释为属于任何类别的基因。第二种我们称为“错误发现率问题”,因为一些工具低估了所进行的并行测试的真实数量。

结果

在这里,我们展示了这些问题对几个真实RNA测序数据集的影响,并使用模拟RNA测序数据来量化这些问题的影响。我们表明,这些问题的严重程度取决于基因集库、列表中的基因数量以及数据集中的噪声程度。可以通过更换ORA的软件包/网站或改用另一种方法(如功能类评分)来缓解这些问题。

可用性和实现方式

已在https://oratool.ziemann-lab.net/提供了一个R/Shiny工具,支持材料可从Zenodo(https://zenodo.org/records/13823301)获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/86096304f049/vbae159f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/e64507adf4b9/vbae159f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/a7a0f3b4ceb2/vbae159f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/5790ea3b5c43/vbae159f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/86096304f049/vbae159f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/e64507adf4b9/vbae159f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/a7a0f3b4ceb2/vbae159f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/5790ea3b5c43/vbae159f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3ae/11557902/86096304f049/vbae159f4.jpg

相似文献

1
Two subtle problems with overrepresentation analysis.过度代表性分析存在的两个细微问题。
Bioinform Adv. 2024 Oct 21;4(1):vbae159. doi: 10.1093/bioadv/vbae159. eCollection 2024.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data.Genetonic:一个用于简化 RNA-seq 数据分析的 R/Bioconductor 包。
BMC Bioinformatics. 2021 Dec 23;22(1):610. doi: 10.1186/s12859-021-04461-5.
4
goSTAG: gene ontology subtrees to tag and annotate genes within a set.goSTAG:用于标记和注释一组基因的基因本体子树。
Source Code Biol Med. 2017 Apr 13;12:6. doi: 10.1186/s13029-017-0066-1. eCollection 2017.
5
Toward comprehensive functional analysis of gene lists weighted by gene essentiality scores.针对基于基因重要性评分加权的基因列表进行全面的功能分析。
Bioinformatics. 2021 Dec 7;37(23):4399-4404. doi: 10.1093/bioinformatics/btab475.
6
Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets.Scellpam:一个用于在 scRNAseq 数据集上围绕质心进行并行分区的 R 包/C++ 库。
BMC Bioinformatics. 2023 Sep 14;24(1):342. doi: 10.1186/s12859-023-05471-1.
7
BgeeDB, an R package for retrieval of curated expression datasets and for gene list expression localization enrichment tests.BgeeDB,一个用于检索经过整理的表达数据集以及进行基因列表表达定位富集测试的R软件包。
F1000Res. 2016 Nov 23;5:2748. doi: 10.12688/f1000research.9973.2. eCollection 2016.
8
ORA , FCS , and PT Strategies in Functional Enrichment Analysis.功能富集分析中的ORA、FCS和PT策略
Methods Mol Biol. 2021;2361:163-178. doi: 10.1007/978-1-0716-1641-3_10.
9
CTISL: a dynamic stacking multi-class classification approach for identifying cell types from single-cell RNA-seq data.CTISL:一种动态堆叠多类分类方法,用于从单细胞 RNA-seq 数据中识别细胞类型。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae063.
10
Identify, quantify and characterize cellular communication from single-cell RNA sequencing data with scSeqComm.使用scSeqComm从单细胞RNA测序数据中识别、量化和表征细胞间通讯。
Bioinformatics. 2022 Mar 28;38(7):1920-1929. doi: 10.1093/bioinformatics/btac036.

引用本文的文献

1
Pathway Analysis Interpretation in the Multi-Omic Era.多组学时代的通路分析解读
BioTech (Basel). 2025 Jul 29;14(3):58. doi: 10.3390/biotech14030058.
2
5G-exposed human skin cells do not respond with altered gene expression and methylation profiles.暴露于5G环境下的人体皮肤细胞在基因表达和甲基化谱方面没有出现变化。
PNAS Nexus. 2025 May 13;4(5):pgaf127. doi: 10.1093/pnasnexus/pgaf127. eCollection 2025 May.

本文引用的文献

1
WebGestalt 2024: faster gene set analysis and new support for metabolomics and multi-omics.WebGestalt 2024:更快的基因集分析以及对代谢组学和多组学的新支持。
Nucleic Acids Res. 2024 Jul 5;52(W1):W415-W421. doi: 10.1093/nar/gkae456.
2
WikiPathways 2024: next generation pathway database.WikiPathways 2024:下一代路径数据库。
Nucleic Acids Res. 2024 Jan 5;52(D1):D679-D689. doi: 10.1093/nar/gkad960.
3
The Reactome Pathway Knowledgebase 2024.Reactome 通路知识库 2024.
Nucleic Acids Res. 2024 Jan 5;52(D1):D672-D678. doi: 10.1093/nar/gkad1025.
4
g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update).用于功能富集分析和基因标识符映射的可互操作网络服务(2023 更新)。
Nucleic Acids Res. 2023 Jul 5;51(W1):W207-W212. doi: 10.1093/nar/gkad347.
5
The Gene Ontology knowledgebase in 2023.2023 版基因本体论知识库。
Genetics. 2023 May 4;224(1). doi: 10.1093/genetics/iyad031.
6
Interpreting omics data with pathway enrichment analysis.通过通路富集分析解读组学数据。
Trends Genet. 2023 Apr;39(4):308-319. doi: 10.1016/j.tig.2023.01.003. Epub 2023 Feb 6.
7
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest.2023 年的 STRING 数据库:针对任何感兴趣的测序基因组的蛋白质-蛋白质关联网络和功能富集分析。
Nucleic Acids Res. 2023 Jan 6;51(D1):D638-D646. doi: 10.1093/nar/gkac1000.
8
KEGG for taxonomy-based analysis of pathways and genomes.KEGG 用于基于分类的途径和基因组分析。
Nucleic Acids Res. 2023 Jan 6;51(D1):D587-D592. doi: 10.1093/nar/gkac963.
9
DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update).DAVID:一个用于基因列表功能富集分析和功能注释的网络服务器(2021 更新)。
Nucleic Acids Res. 2022 Jul 5;50(W1):W216-W221. doi: 10.1093/nar/gkac194.
10
Urgent need for consistent standards in functional enrichment analysis.迫切需要在功能富集分析中使用一致的标准。
PLoS Comput Biol. 2022 Mar 9;18(3):e1009935. doi: 10.1371/journal.pcbi.1009935. eCollection 2022 Mar.