Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA, USA.
Proc Natl Acad Sci U S A. 2010 Sep 21;107(38):16465-70. doi: 10.1073/pnas.1002425107. Epub 2010 Sep 1.
Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. One way of getting at such an understanding is to find out which parts of our DNA, such as single-nucleotide polymorphisms, affect particular intermediary processes such as gene expression. Naively, such associations can be identified using a simple statistical test on all paired combinations of genetic variants and gene transcripts. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. We present a statistical model that jointly corrects for two particular kinds of hidden structure--population structure (e.g., race, family-relatedness), and microarray expression artifacts (e.g., batch effects), when these confounders are unknown. Applying our method to both real and synthetic, human and mouse data, we demonstrate the need for such a joint correction of confounders, and also the disadvantages of other possible approaches based on those in the current literature. In particular, we show that our class of models has maximum power to detect eQTL on synthetic data, and has the best performance on a bronze standard applied to real data. Lastly, our software and the associations we found with it are available at http://www.microsoft.com/science.
了解疾病的遗传基础对于筛查、治疗、药物开发和基础生物学研究都很重要。了解这些遗传基础的一种方法是找出我们的 DNA 中的哪些部分(如单核苷酸多态性)会影响特定的中间过程,如基因表达。从表面上看,可以通过对遗传变异和基因转录本的所有配对组合进行简单的统计测试来发现这些关联。然而,如果不加以适当处理,隐藏在数据中的各种混杂因素会导致虚假关联和遗漏关联。我们提出了一种统计模型,当混杂因素未知时,该模型可以联合纠正两种特定的隐藏结构——群体结构(例如,种族、家族相关性)和微阵列表达伪影(例如,批次效应)。我们将该方法应用于真实和合成的人类和小鼠数据,证明了需要联合纠正混杂因素,并且还证明了基于当前文献中其他方法的缺点。特别是,我们表明,我们的模型类在合成数据上具有最大的 eQTL 检测能力,并且在应用于真实数据的青铜标准上具有最佳性能。最后,我们的软件和我们通过该软件发现的关联可在 http://www.microsoft.com/science 上获取。