• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

机构信息

Department of Biostatistics, Biostatistics Shared Resource Core, VCU Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA, Division of Biomedical Statistics and Informatics and Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.

出版信息

Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.

DOI:10.1093/bioinformatics/btt480
PMID:23958724
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3810845/
Abstract

MOTIVATION

Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data.

RESULTS

We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies.

CONCLUSION

We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.

AVAILABILITY AND IMPLEMENTATION

The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article.

CONTACT

reesese@vcu.edu

摘要

动机

批次效应是由于实验特征导致的样本组(批次)之间的探针特异性系统变化引起的,这些特征与生物学兴趣无关。主成分分析(PCA)通常被用作一种可视化工具,用于确定在应用全局归一化方法后是否存在批次效应。然而,PCA 产生的是变量的线性组合,这些组合贡献了最大的方差,因此如果批次效应不是数据中最大的变异性源,PCA 不一定能检测到批次效应。

结果

我们提出了一种 PCA 的扩展,称为引导 PCA(gPCA),用于量化批次效应的存在。我们描述了一个使用 gPCA 检验批次效应是否存在的检验统计量。我们将我们提出的使用 gPCA 导出的检验统计量应用于模拟数据和两个拷贝数变异案例研究:第一个研究由 614 个乳腺癌家族研究的样本组成,使用了 Illumina Human 660 珠芯片阵列;第二个案例研究由 703 个家族血压研究的样本组成,使用了 Affymetrix SNP Array 6.0。我们证明了我们的统计量具有良好的统计特性,并能够在两个拷贝数变异案例研究中识别出显著的批次效应。

结论

我们开发了一种新的统计量,它使用 gPCA 来识别高通量基因组数据中是否存在批次效应。尽管我们的例子涉及拷贝数数据,但 gPCA 是通用的,也可以用于其他类型的数据。

可用性和实现

gPCA R 包(可通过 CRAN 获取)提供了执行本文方法的功能和数据。

联系方式

reesese@vcu.edu

相似文献

1
A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。
Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.
2
Statistical significance of variables driving systematic variation in high-dimensional data.驱动高维数据系统变异的变量的统计学显著性。
Bioinformatics. 2015 Feb 15;31(4):545-54. doi: 10.1093/bioinformatics/btu674. Epub 2014 Oct 21.
3
Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets.基于风险意识的批次效应校正:从高通量基因组数据集中最大化信息提取。
BMC Bioinformatics. 2016 Sep 1;17(1):332. doi: 10.1186/s12859-016-1212-5.
4
TAFFYS: An Integrated Tool for Comprehensive Analysis of Genomic Aberrations in Tumor Samples.TAFFYS:肿瘤样本基因组畸变综合分析的集成工具
PLoS One. 2015 Jun 25;10(6):e0129835. doi: 10.1371/journal.pone.0129835. eCollection 2015.
5
Detection of batch effects in liquid chromatography-mass spectrometry metabolomic data using guided principal component analysis.使用引导主成分分析检测液相色谱-质谱代谢组学数据中的批次效应
Talanta. 2014 Dec;130:442-8. doi: 10.1016/j.talanta.2014.07.031. Epub 2014 Jul 18.
6
Fast detection of de novo copy number variants from SNP arrays for case-parent trios.基于 SNP 芯片的先证者-父母三体型检测新发拷贝数变异的快速方法。
BMC Bioinformatics. 2012 Dec 12;13:330. doi: 10.1186/1471-2105-13-330.
7
Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort.利用大型临床队列中的 SNP 基因分型阵列鉴定和验证拷贝数变异。
BMC Genomics. 2012 Jun 15;13:241. doi: 10.1186/1471-2164-13-241.
8
Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios.Illumina Infinium全基因组单核苷酸多态性(SNP)数据的标准化可改善拷贝数估计和等位基因强度比。
BMC Bioinformatics. 2008 Oct 2;9:409. doi: 10.1186/1471-2105-9-409.
9
cnAnalysis450k: an R package for comparative analysis of 450k/EPIC Illumina methylation array derived copy number data.cnAnalysis450k:一个用于对Illumina 450k/EPIC甲基化芯片衍生的拷贝数数据进行比较分析的R软件包。
Bioinformatics. 2017 Aug 1;33(15):2266-2272. doi: 10.1093/bioinformatics/btx156.
10
Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform.用于评估 Affymetrix 6.0 SNP 阵列平台的基因组拷贝数变异的软件比较。
BMC Bioinformatics. 2011 May 31;12:220. doi: 10.1186/1471-2105-12-220.

引用本文的文献

1
Decoding rheumatoid arthritis: Biomarker identification and immune profiling via bioinformatics and Mendelian randomization.类风湿性关节炎的解码:通过生物信息学和孟德尔随机化进行生物标志物识别和免疫谱分析。
Medicine (Baltimore). 2025 Aug 22;104(34):e43872. doi: 10.1097/MD.0000000000043872.
2
Genome-Wide cfDNA Methylation Profiling Reveals Robust Hypermethylation Signatures in Ovarian Cancer.全基因组cfDNA甲基化分析揭示卵巢癌中强大的高甲基化特征
Cancers (Basel). 2025 Jun 17;17(12):2026. doi: 10.3390/cancers17122026.
3
Genetic Prοpensity for Different Aspects of Dementia Pathology and Cognitive Decline in a Community Elderly Population.社区老年人群痴呆病理不同方面及认知衰退的遗传倾向
Int J Mol Sci. 2025 Jan 22;26(3):910. doi: 10.3390/ijms26030910.
4
The quest for environmental analytical microbiology: absolute quantitative microbiome using cellular internal standards.环境分析微生物学的探索:使用细胞内标物的绝对定量微生物组
Microbiome. 2025 Jan 27;13(1):26. doi: 10.1186/s40168-024-02009-2.
5
Peripheral blood miRNAs are associated with airflow below threshold in children with asthma.外周血微小RNA与哮喘儿童低于阈值的气流相关。
Respir Res. 2025 Jan 24;26(1):38. doi: 10.1186/s12931-025-03116-w.
6
Wise Roles and Future Visionary Endeavors of Current Emperor: Advancing Dynamic Methods for Longitudinal Microbiome Meta-Omics Data in Personalized and Precision Medicine.当代帝王的明智角色与未来前瞻性努力:推进个性化与精准医学中纵向微生物组元组学数据的动态方法
Adv Sci (Weinh). 2024 Dec;11(47):e2400458. doi: 10.1002/advs.202400458. Epub 2024 Nov 13.
7
Circulating microRNAs associated with bronchodilator response in childhood asthma.与儿童哮喘支气管扩张反应相关的循环 microRNAs。
BMC Pulm Med. 2024 Nov 4;24(1):553. doi: 10.1186/s12890-024-03372-4.
8
Variability of 7K and 11K SomaScan Plasma Proteomics Assays.7K和11K SomaScan血浆蛋白质组学检测的变异性
J Proteome Res. 2024 Dec 6;23(12):5531-5539. doi: 10.1021/acs.jproteome.4c00667. Epub 2024 Oct 30.
9
Thinking points for effective batch correction on biomedical data.生物医学数据有效批量校正的思考要点。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae515.
10
Assessing and mitigating batch effects in large-scale omics studies.评估和减轻大规模组学研究中的批次效应。
Genome Biol. 2024 Oct 3;25(1):254. doi: 10.1186/s13059-024-03401-9.

本文引用的文献

1
Batch effect removal methods for microarray gene expression data integration: a survey.批量效应去除方法在微阵列基因表达数据整合中的应用:综述。
Brief Bioinform. 2013 Jul;14(4):469-90. doi: 10.1093/bib/bbs037. Epub 2012 Jul 31.
2
Preprocessing and Quality Control Strategies for Illumina DASL Assay-Based Brain Gene Expression Studies with Semi-Degraded Samples.基于Illumina DASL分析的半降解样本脑基因表达研究的预处理和质量控制策略
Front Genet. 2012 Feb 24;3:11. doi: 10.3389/fgene.2012.00011. eCollection 2012.
3
R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment.R/DWD:用于分类、可视化和批量调整的距离加权判别。
Bioinformatics. 2012 Apr 15;28(8):1182-3. doi: 10.1093/bioinformatics/bts096. Epub 2012 Feb 24.
4
The sva package for removing batch effects and other unwanted variation in high-throughput experiments.sva 包用于去除高通量实验中的批次效应和其他不需要的变异。
Bioinformatics. 2012 Mar 15;28(6):882-3. doi: 10.1093/bioinformatics/bts034. Epub 2012 Jan 17.
5
Batch effect correction for genome-wide methylation data with Illumina Infinium platform.基于 Illumina Infinium 平台的全基因组甲基化数据的批次效应校正。
BMC Med Genomics. 2011 Dec 16;4:84. doi: 10.1186/1755-8794-4-84.
6
Integrated analysis of multiple microarray datasets identifies a reproducible survival predictor in ovarian cancer.多微阵列数据集的综合分析确定了卵巢癌中可重复的生存预测因子。
PLoS One. 2011 Mar 29;6(3):e18202. doi: 10.1371/journal.pone.0018202.
7
Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods.去除表达微阵列数据分析中的批次效应:六种批次调整方法的评估。
PLoS One. 2011 Feb 28;6(2):e17238. doi: 10.1371/journal.pone.0017238.
8
Weighted Distance Weighted Discrimination and Its Asymptotic Properties.加权距离加权判别及其渐近性质。
J Am Stat Assoc. 2010 Mar 1;105(489):401-414. doi: 10.1198/jasa.2010.tm08487.
9
Visualization and statistical comparisons of microbial communities using R packages on Phylochip data.使用R软件包对Phylochip数据进行微生物群落的可视化和统计比较。
Pac Symp Biocomput. 2011:142-53. doi: 10.1142/9789814335058_0016.
10
Quality control and quality assurance in genotypic data for genome-wide association studies.全基因组关联研究中基因型数据的质量控制和质量保证。
Genet Epidemiol. 2010 Sep;34(6):591-602. doi: 10.1002/gepi.20516.