Data Science, Radboud University, Institute for Computing and Information Sciences, Nijmegen, The Netherlands.
Molecular Developmental Biology, Radboud University, Research Institute for Molecular Life Sciences, Nijmegen, The Netherlands.
PLoS One. 2020 May 1;15(5):e0231824. doi: 10.1371/journal.pone.0231824. eCollection 2020.
Cellular identity and behavior is controlled by complex gene regulatory networks. Transcription factors (TFs) bind to specific DNA sequences to regulate the transcription of their target genes. On the basis of these TF motifs in cis-regulatory elements we can model the influence of TFs on gene expression. In such models of TF motif activity the data is usually modeled assuming a linear relationship between the motif activity and the gene expression level. A commonly used method to model motif influence is based on Ridge Regression. One important assumption of linear regression is the independence between samples. However, if samples are generated from the same cell line, tissue, or other biological source, this assumption may be invalid. This same assumption of independence is also applied to different yet similar experimental conditions, which may also be inappropriate. In theory, the independence assumption between samples could lead to loss in signal detection. Here we investigate whether a Bayesian model that allows for correlations results in more accurate inference of motif activities.
We extend the Ridge Regression to a Bayesian Linear Mixed Model, which allows us to model dependence between different samples. In a simulation study, we investigate the differences between the two model assumptions. We show that our Bayesian Linear Mixed Model implementation outperforms Ridge Regression in a simulation scenario where the noise, which is the signal that can not be explained by TF motifs, is uncorrelated. However, we demonstrate that there is no such gain in performance if the noise has a similar covariance structure over samples as the signal that can be explained by motifs. We give a mathematical explanation to why this is the case. Using four representative real datasets we show that at most ∼​40% of the signal is explained by motifs using the linear model. With these data there is no advantage to using the Bayesian Linear Mixed Model, due to the similarity of the covariance structure.
AVAILABILITY & IMPLEMENTATION: The project implementation is available at https://github.com/Sim19/SimGEXPwMotifs.
细胞的特性和行为受复杂的基因调控网络控制。转录因子 (TFs) 与特定的 DNA 序列结合,以调节其靶基因的转录。基于顺式调控元件中的这些 TF 基序,我们可以模拟 TFs 对基因表达的影响。在这些 TF 基序活性模型中,数据通常是在假设基序活性与基因表达水平之间存在线性关系的情况下进行建模的。一种常用的建模基序影响的方法是基于岭回归。线性回归的一个重要假设是样本之间的独立性。然而,如果样本是从同一个细胞系、组织或其他生物来源产生的,那么这一假设可能是无效的。这种独立性的假设也适用于不同但相似的实验条件,这可能也不合适。从理论上讲,样本之间的独立性假设可能导致信号检测的丢失。在这里,我们研究了允许相关性的贝叶斯模型是否会导致基序活性的更准确推断。
我们将岭回归扩展到贝叶斯线性混合模型,该模型允许我们对不同样本之间的相关性进行建模。在模拟研究中,我们研究了两种模型假设之间的差异。我们表明,在噪声(即不能用 TF 基序解释的信号)不相关的模拟场景中,我们的贝叶斯线性混合模型实现优于岭回归。然而,我们证明,如果噪声在样本之间具有与可以用基序解释的信号相似的协方差结构,则不会有性能增益。我们给出了一个数学解释,说明为什么会这样。使用四个具有代表性的真实数据集,我们表明,使用线性模型最多可以解释约 40%的信号。由于协方差结构的相似性,对于这些数据,使用贝叶斯线性混合模型没有优势。
项目实现可在 https://github.com/Sim19/SimGEXPwMotifs 获得。