Statistical Genomics Group, Paul O'Gorman Building, UCL Cancer Institute, London WC1E 6BT, UK.
Bioinformatics. 2011 Jun 1;27(11):1496-505. doi: 10.1093/bioinformatics/btr171. Epub 2011 Apr 6.
A common difficulty in large-scale microarray studies is the presence of confounding factors, which may significantly skew estimates of statistical significance, cause unreliable feature selection and high false negative rates. To deal with these difficulties, an algorithmic framework known as Surrogate Variable Analysis (SVA) was recently proposed.
Based on the notion that data can be viewed as an interference pattern, reflecting the superposition of independent effects and random noise, we present a modified SVA, called Independent Surrogate Variable Analysis (ISVA), to identify features correlating with a phenotype of interest in the presence of potential confounding factors. Using simulated data, we show that ISVA performs well in identifying confounders as well as outperforming methods which do not adjust for confounding. Using four large-scale Illumina Infinium DNA methylation datasets subject to low signal to noise ratios and substantial confounding by beadchip effects and variable bisulfite conversion efficiency, we show that ISVA improves the identifiability of confounders and that this enables a framework for feature selection that is more robust to model misspecification and heterogeneous phenotypes. Finally, we demonstrate similar improvements of ISVA across four mRNA expression datasets. Thus, ISVA should be useful as a feature selection tool in studies that are subject to confounding.
An R-package isva is available from www.cran.r-project.org.
在大规模微阵列研究中,一个常见的困难是存在混杂因素,这可能会严重扭曲统计显著性的估计,导致不可靠的特征选择和高假阴性率。为了解决这些困难,最近提出了一种称为替代变量分析(SVA)的算法框架。
基于数据可以看作是干涉模式的概念,反映了独立效应和随机噪声的叠加,我们提出了一种称为独立替代变量分析(ISVA)的修改后的 SVA,以在存在潜在混杂因素的情况下识别与感兴趣表型相关的特征。使用模拟数据,我们表明 ISVA 在识别混杂因素方面表现良好,并且优于不调整混杂因素的方法。使用四个经过大规模 Illumina Infinium DNA 甲基化数据集进行测试,这些数据集受到低信噪比以及 beadchip 效应和可变亚硫酸氢盐转化效率的严重混杂的影响,我们表明 ISVA 提高了混杂因素的可识别性,并且这为特征选择提供了一个更稳健的框架,更能抵抗模型失拟和异质表型。最后,我们在四个 mRNA 表达数据集上证明了 ISVA 的相似改进。因此,ISVA 应该是在受混杂因素影响的研究中作为特征选择工具很有用。
www.cran.r-project.org 上提供了一个名为 isva 的 R 包。