Ganjali Mojtaba, Baghfalaki Taban, Berridge Damon
School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran; Department of Statistics, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran.
School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran; Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran.
PLoS One. 2015 Apr 24;10(4):e0123791. doi: 10.1371/journal.pone.0123791. eCollection 2015.
In this paper, the problem of identifying differentially expressed genes under different conditions using gene expression microarray data, in the presence of outliers, is discussed. For this purpose, the robust modeling of gene expression data using some powerful distributions known as normal/independent distributions is considered. These distributions include the Student's t and normal distributions which have been used previously, but also include extensions such as the slash, the contaminated normal and the Laplace distributions. The purpose of this paper is to identify differentially expressed genes by considering these distributional assumptions instead of the normal distribution. A Bayesian approach using the Markov Chain Monte Carlo method is adopted for parameter estimation. Two publicly available gene expression data sets are analyzed using the proposed approach. The use of the robust models for detecting differentially expressed genes is investigated. This investigation shows that the choice of model for differentiating gene expression data is very important. This is due to the small number of replicates for each gene and the existence of outlying data. Comparison of the performance of these models is made using different statistical criteria and the ROC curve. The method is illustrated using some simulation studies. We demonstrate the flexibility of these robust models in identifying differentially expressed genes.
本文讨论了在存在异常值的情况下,利用基因表达微阵列数据识别不同条件下差异表达基因的问题。为此,考虑使用一些强大的分布(如正态/独立分布)对基因表达数据进行稳健建模。这些分布包括先前已使用的学生t分布和正态分布,但也包括诸如斜线分布、污染正态分布和拉普拉斯分布等扩展。本文的目的是通过考虑这些分布假设而非正态分布来识别差异表达基因。采用基于马尔可夫链蒙特卡罗方法的贝叶斯方法进行参数估计。使用所提出的方法分析了两个公开可用的基因表达数据集。研究了使用稳健模型检测差异表达基因的情况。该研究表明,用于区分基因表达数据的模型选择非常重要。这是由于每个基因的重复次数较少以及存在异常数据。使用不同的统计标准和ROC曲线对这些模型的性能进行了比较。通过一些模拟研究对该方法进行了说明。我们展示了这些稳健模型在识别差异表达基因方面的灵活性。