Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, Ontario, Canada.
Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada.
PLoS One. 2016 Oct 4;11(10):e0163595. doi: 10.1371/journal.pone.0163595. eCollection 2016.
Evaluating the similarity of different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities' measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low-especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities' signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on two RNA-seq datasets and one miRNA-seq dataset.
评估不同测量变量之间的相似性是统计学的一项基本任务,也是许多生物信息学算法的关键部分。在这里,我们提出了一种基于高通量测序数据的贝叶斯方案,用于估计不同实体测量值之间的相关性。这些实体可以是不同基因或 miRNA 的表达水平,这些表达水平可以通过 RNA-seq 来测量;也可以是不同转录因子或组蛋白标记的表达水平,这些表达水平可以通过 ChIP-seq 来测量;甚至可以是不同类型的实体的组合。我们的贝叶斯公式既考虑了测量信号水平,也考虑了这些水平的不确定性,因为不同实验中的测序深度不同,以及个体实体的绝对水平也不同,这两者都会影响测量的精度。与传统的皮尔逊相关分析相比,我们表明,当测量置信度高时,我们的贝叶斯相关分析保留了高度的相关性,但当测量置信度低时,它会抑制相关性——特别是对于信号水平低的实体。此外,我们还考虑了先验对贝叶斯相关估计的影响。也许令人惊讶的是,我们表明,对实体信号水平的朴素、均匀先验会导致高度有偏的相关估计,尤其是当不同实验的测序深度差异很大时。然而,我们提出了两种替代的先验,证明可以解决这个问题。我们还证明,与传统的皮尔逊相关一样,我们的贝叶斯相关计算在机器学习意义上构成了一个核,因此可以作为任何基于核的机器学习算法中的相似性度量。我们在两个 RNA-seq 数据集和一个 miRNA-seq 数据集上展示了我们的方法。