一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

机构信息

Department of Biostatistics, Biostatistics Shared Resource Core, VCU Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA, Division of Biomedical Statistics and Informatics and Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.

出版信息

Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.

DOI:10.1093/bioinformatics/btt480

PMID:23958724

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3810845/

Abstract

MOTIVATION

Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data.

RESULTS

We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies.

CONCLUSION

We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.

AVAILABILITY AND IMPLEMENTATION

The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article.

CONTACT

reesese@vcu.edu

摘要

动机

批次效应是由于实验特征导致的样本组（批次）之间的探针特异性系统变化引起的，这些特征与生物学兴趣无关。主成分分析（PCA）通常被用作一种可视化工具，用于确定在应用全局归一化方法后是否存在批次效应。然而，PCA 产生的是变量的线性组合，这些组合贡献了最大的方差，因此如果批次效应不是数据中最大的变异性源，PCA 不一定能检测到批次效应。

结果

我们提出了一种 PCA 的扩展，称为引导 PCA（gPCA），用于量化批次效应的存在。我们描述了一个使用 gPCA 检验批次效应是否存在的检验统计量。我们将我们提出的使用 gPCA 导出的检验统计量应用于模拟数据和两个拷贝数变异案例研究：第一个研究由 614 个乳腺癌家族研究的样本组成，使用了 Illumina Human 660 珠芯片阵列；第二个案例研究由 703 个家族血压研究的样本组成，使用了 Affymetrix SNP Array 6.0。我们证明了我们的统计量具有良好的统计特性，并能够在两个拷贝数变异案例研究中识别出显著的批次效应。

结论

我们开发了一种新的统计量，它使用 gPCA 来识别高通量基因组数据中是否存在批次效应。尽管我们的例子涉及拷贝数数据，但 gPCA 是通用的，也可以用于其他类型的数据。

可用性和实现

gPCA R 包（可通过 CRAN 获取）提供了执行本文方法的功能和数据。

联系方式

reesese@vcu.edu

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

机构信息

出版信息

MOTIVATION

RESULTS

CONCLUSION

AVAILABILITY AND IMPLEMENTATION

CONTACT

动机

结果

结论

可用性和实现

联系方式

相似文献

引用本文的文献

本文引用的文献

一种使用引导主成分分析识别高通量基因组数据批次效应的新统计方法。

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

机构信息

出版信息

MOTIVATION

RESULTS

CONCLUSION

AVAILABILITY AND IMPLEMENTATION

CONTACT

动机

结果

结论

可用性和实现

联系方式

相似文献

引用本文的文献

本文引用的文献