The Nuffield Division of Clinical Laboratory Sciences.
The Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN UK.
Bioinformatics. 2017 Oct 1;33(19):3058-3064. doi: 10.1093/bioinformatics/btx355.
The identification of genetic variants influencing gene expression (known as expression quantitative trait loci or eQTLs) is important in unravelling the genetic basis of complex traits. Detecting multiple eQTLs simultaneously in a population based on paired DNA-seq and RNA-seq assays employs two competing types of models: models which rely on appropriate transformations of RNA-seq data (and are powered by a mature mathematical theory), or count-based models, which represent digital gene expression explicitly, thus rendering such transformations unnecessary. The latter constitutes an immensely popular methodology, which is however plagued by mathematical intractability.
We develop tractable count-based models, which are amenable to efficient estimation through the introduction of latent variables and the appropriate application of recent statistical theory in a sparse Bayesian modelling framework. Furthermore, we examine several transformation methods for RNA-seq read counts and we introduce arcsin, logit and Laplace smoothing as preprocessing steps for transformation-based models. Using natural and carefully simulated data from the 1000 Genomes and gEUVADIS projects, we benchmark both approaches under a variety of scenarios, including the presence of noise and violation of basic model assumptions. We demonstrate that an arcsin transformation of Laplace-smoothed data is at least as good as state-of-the-art models, particularly at small samples. Furthermore, we show that an over-dispersed Poisson model is comparable to the celebrated Negative Binomial, but much easier to estimate. These results provide strong support for transformation-based versus count-based (particularly Negative-Binomial-based) models for eQTL mapping.
All methods are implemented in the free software eQTLseq: https://github.com/dvav/eQTLseq.
dimitris.vavoulis@well.ox.ac.uk.
Supplementary data are available at Bioinformatics online.
鉴定影响基因表达的遗传变异(称为表达数量性状基因座或 eQTL)对于揭示复杂性状的遗传基础非常重要。在基于配对 DNA-seq 和 RNA-seq 检测的人群中同时检测多个 eQTL 采用了两种相互竞争的模型:依赖于 RNA-seq 数据的适当转换的模型(并且有成熟的数学理论支持),或基于计数的模型,其明确表示数字基因表达,从而使这种转换变得不必要。后者构成了一种非常流行的方法,但存在数学上的不可行性。
我们开发了可行的基于计数的模型,通过引入潜在变量并在稀疏贝叶斯建模框架中适当地应用最近的统计理论,可以有效地估计这些模型。此外,我们研究了几种 RNA-seq 读取计数的转换方法,并介绍了反正切、逻辑和拉普拉斯平滑作为转换模型的预处理步骤。使用来自 1000 基因组和 gEUVADIS 项目的自然和精心模拟的数据,我们在各种场景下对这两种方法进行了基准测试,包括存在噪声和违反基本模型假设的情况。我们证明,反正切变换的拉普拉斯平滑数据至少与最先进的模型一样好,特别是在小样本的情况下。此外,我们表明过度分散的泊松模型与著名的负二项式相当,但更容易估计。这些结果为 eQTL 映射的基于转换的模型与基于计数的模型(特别是基于负二项式的模型)提供了强有力的支持。
所有方法都在免费软件 eQTLseq 中实现:https://github.com/dvav/eQTLseq。
dimitris.vavoulis@well.ox.ac.uk。
补充数据可在《生物信息学》在线获取。