• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

驱动高维数据系统变异的变量的统计学显著性。

Statistical significance of variables driving systematic variation in high-dimensional data.

作者信息

Chung Neo Christopher, Storey John D

机构信息

Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

出版信息

Bioinformatics. 2015 Feb 15;31(4):545-54. doi: 10.1093/bioinformatics/btu674. Epub 2014 Oct 21.

DOI:10.1093/bioinformatics/btu674
PMID:25336500
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4325543/
Abstract

MOTIVATION

There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting.

RESULTS

We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.

AVAILABILITY AND IMPLEMENTATION

An R software package, called jackstraw, is available in CRAN.

CONTACT

jstorey@princeton.edu.

摘要

动机

有许多成熟的方法,如主成分分析(PCA),用于自动捕捉大规模基因组数据中潜在变量引起的系统变异。PCA及相关方法可以直接对一个复杂的生物学变量进行定量表征,而这个变量用其他方式很难精确界定或建模。在这种情况下,一个尚未解决的问题是如何系统地识别那些驱动PCA所捕捉到的系统变异的基因组变量。主成分(PC)(以及其他系统变异估计值)直接由基因组变量本身构建而成,这使得在使用传统方法时,由于过度拟合,统计显著性的测量值会被人为夸大。

结果

我们引入了一种名为jackstraw的新方法,它能让人们准确识别与PC的任何子集或线性组合在统计上显著相关的基因组变量。所提出的方法可以极大地简化基因组学中遇到的复杂显著性检验问题,并可用于识别与潜在变量显著相关的基因组变量。通过模拟,我们证明了我们的方法在一系列相关场景中都能获得准确的统计显著性测量值。我们考虑了酵母细胞周期基因表达数据,并表明所提出的方法可用于直接识别受细胞周期调控的基因,并能准确测量其统计显著性。我们还分析了创伤后患者的基因表达数据,使基因表达数据提供一种分子驱动的表型。使用我们的方法,与使用临床定义(尽管可能不准确)的表型的原始分析相比,我们发现炎症相关基因集的富集程度更高。所提出的方法在系统变异的大规模量化和基因水平的显著性分析之间架起了一座有用的桥梁。

可用性与实现

一个名为jackstraw的R软件包可在CRAN上获取。

联系方式

jstorey@princeton.edu。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/b807a45adb79/btu674f6p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/18c84f384bb6/btu674f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/625c2ae8e2c5/btu674f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/9bedeec3d74c/btu674f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/df2805e619ed/btu674f4p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/b3e2f7501cb1/btu674f5p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/b807a45adb79/btu674f6p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/18c84f384bb6/btu674f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/625c2ae8e2c5/btu674f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/9bedeec3d74c/btu674f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/df2805e619ed/btu674f4p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/b3e2f7501cb1/btu674f5p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f492/4325543/b807a45adb79/btu674f6p.jpg

相似文献

1
Statistical significance of variables driving systematic variation in high-dimensional data.驱动高维数据系统变异的变量的统计学显著性。
Bioinformatics. 2015 Feb 15;31(4):545-54. doi: 10.1093/bioinformatics/btu674. Epub 2014 Oct 21.
2
Spectral gene set enrichment (SGSE).光谱基因集富集(SGSE)。
BMC Bioinformatics. 2015 Mar 3;16:70. doi: 10.1186/s12859-015-0490-7.
3
A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。
Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.
4
Clustering of diverse genomic data using information fusion.利用信息融合对多样的基因组数据进行聚类分析。
Bioinformatics. 2005 Feb 15;21(4):423-9. doi: 10.1093/bioinformatics/bti186. Epub 2004 Dec 17.
5
Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes.用于具有连续或生存结局的微阵列数据基因集富集的监督主成分分析。
Bioinformatics. 2008 Nov 1;24(21):2474-81. doi: 10.1093/bioinformatics/btn458. Epub 2008 Aug 27.
6
Principal component analysis for clustering gene expression data.用于聚类基因表达数据的主成分分析。
Bioinformatics. 2001 Sep;17(9):763-74. doi: 10.1093/bioinformatics/17.9.763.
7
Identifying periodically expressed transcripts in microarray time series data.在微阵列时间序列数据中识别周期性表达的转录本。
Bioinformatics. 2004 Jan 1;20(1):5-20. doi: 10.1093/bioinformatics/btg364.
8
Biologically valid linear factor models of gene expression.基因表达的生物学有效线性因子模型。
Bioinformatics. 2004 Nov 22;20(17):3021-33. doi: 10.1093/bioinformatics/bth354. Epub 2004 Jun 16.
9
Probabilistic models of genetic variation in structured populations applied to global human studies.应用于全球人类研究的结构化群体中基因变异的概率模型。
Bioinformatics. 2016 Mar 1;32(5):713-21. doi: 10.1093/bioinformatics/btv641. Epub 2015 Nov 6.
10
Interpretation of ANOVA models for microarray data using PCA.使用主成分分析(PCA)对微阵列数据的方差分析模型进行解释。
Bioinformatics. 2007 Jan 15;23(2):184-90. doi: 10.1093/bioinformatics/btl572. Epub 2006 Nov 14.

引用本文的文献

1
Spatially-distinct programming of macrophage diversity within the granulomas of infected nonhuman primates.感染非人灵长类动物肉芽肿内巨噬细胞多样性的空间特异性编程。
bioRxiv. 2025 Jun 17:2025.06.12.659348. doi: 10.1101/2025.06.12.659348.
2
RFC1 regulates the expansion of neural progenitors in the developing zebrafish cerebellum.RFC1调节斑马鱼幼体小脑神经祖细胞的增殖。
Nat Commun. 2025 Jul 1;16(1):6019. doi: 10.1038/s41467-025-60775-5.
3
In vitro generation of a ureteral organoid from pluripotent stem cells.利用多能干细胞在体外生成输尿管类器官。

本文引用的文献

1
Remarks on Parallel Analysis.关于平行分析的评论
Multivariate Behav Res. 1992 Oct 1;27(4):509-40. doi: 10.1207/s15327906mbr2704_2.
2
Corrected confidence bands for functional data using principal components.使用主成分对函数型数据进行校正的置信带
Biometrics. 2013 Mar;69(1):41-51. doi: 10.1111/j.1541-0420.2012.01808.x. Epub 2012 Sep 24.
3
Dissecting inflammatory complications in critically injured patients by within-patient gene expression changes: a longitudinal clinical genomics study.通过患者个体内基因表达变化解析危重症患者的炎症并发症:一项纵向临床基因组学研究。
Nat Commun. 2025 Jun 20;16(1):5309. doi: 10.1038/s41467-025-60693-6.
4
Integrative analysis of bulk and single-cell gene expression profiles to identify bone marrow mesenchymal cell heterogeneity and prognostic significance in multiple myeloma.整合分析批量和单细胞基因表达谱以鉴定多发性骨髓瘤中骨髓间充质细胞的异质性及其预后意义。
J Transl Med. 2025 Jun 16;23(1):659. doi: 10.1186/s12967-025-06637-6.
5
Single-cell RNA sequencing dissects the immunosuppressive signatures in Helicobacter pylori-infected human gastric ecosystem.单细胞RNA测序剖析幽门螺杆菌感染的人类胃生态系统中的免疫抑制特征。
Nat Commun. 2025 Apr 25;16(1):3903. doi: 10.1038/s41467-025-59339-4.
6
Calpain-2-Mediated Endothelial Focal Adhesion Disruption in Thoracic Aortic Dissection.钙蛋白酶-2介导的胸主动脉夹层中内皮细胞黏着斑破坏
Adv Sci (Weinh). 2025 Jul;12(25):e2501112. doi: 10.1002/advs.202501112. Epub 2025 Apr 2.
7
Unveiling a Novel Glioblastoma Deep Molecular Profiling: Insight into the Cancer Cell Differentiation-Related Mechanisms.揭示一种新型胶质母细胞瘤深度分子图谱:深入了解癌细胞分化相关机制。
ACS Omega. 2025 Mar 8;10(10):10230-10250. doi: 10.1021/acsomega.4c09586. eCollection 2025 Mar 18.
8
Investigation of cell development and tissue structure network based on natural Language processing of scRNA-seq data.基于单细胞RNA测序数据自然语言处理的细胞发育和组织结构网络研究
J Transl Med. 2025 Mar 4;23(1):264. doi: 10.1186/s12967-025-06263-2.
9
Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA.使用广义对比主成分分析识别高维数据集之间的差异模式。
PLoS Comput Biol. 2025 Feb 7;21(2):e1012747. doi: 10.1371/journal.pcbi.1012747. eCollection 2025 Feb.
10
Dynamic Reprogramming of Stromal Pdgfra-expressing cells during WNT-Mediated Transformation of the Intestinal Epithelium.WNT介导的肠上皮转化过程中表达基质Pdgfra的细胞的动态重编程
bioRxiv. 2025 Jan 25:2025.01.22.634326. doi: 10.1101/2025.01.22.634326.
PLoS Med. 2011 Sep;8(9):e1001093. doi: 10.1371/journal.pmed.1001093. Epub 2011 Sep 13.
4
Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis.基于稀疏因子分析的人口结构分析:统一框架与新方法
PLoS Genet. 2010 Sep 16;6(9):e1001117. doi: 10.1371/journal.pgen.1001117.
5
Asymptotic conditional singular value decomposition for high-dimensional genomic data.高维基因组数据的渐近条件奇异值分解
Biometrics. 2011 Jun;67(2):344-52. doi: 10.1111/j.1541-0420.2010.01455.x. Epub 2010 Jun 16.
6
A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.一种惩罚矩阵分解及其在稀疏主成分分析和典型相关分析中的应用。
Biostatistics. 2009 Jul;10(3):515-34. doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.
7
Systematic identification of yeast cell cycle transcription factors using multiple data sources.利用多种数据源对酵母细胞周期转录因子进行系统鉴定。
BMC Bioinformatics. 2008 Dec 5;9:522. doi: 10.1186/1471-2105-9-522.
8
A general framework for multiple testing dependence.多重检验相关性的通用框架。
Proc Natl Acad Sci U S A. 2008 Dec 2;105(48):18718-23. doi: 10.1073/pnas.0808709105. Epub 2008 Nov 24.
9
Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results.估计主成分载荷的置信区间:自举法与渐近结果的比较。
Br J Math Stat Psychol. 2007 Nov;60(Pt 2):295-314. doi: 10.1348/000711006X109636.
10
Assembly of inflammation-related genes for pathway-focused genetic analysis.用于通路聚焦基因分析的炎症相关基因组装。
PLoS One. 2007 Oct 17;2(10):e1035. doi: 10.1371/journal.pone.0001035.