Wei Yingying, Tenzen Toyoaki, Ji Hongkai
Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USADepartment of Statistics, The Chinese University of Hong Kong, Shatin NT, Hong Kong.
Center for Regenerative Medicine, Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA 02114, USA.
Biostatistics. 2015 Jan;16(1):31-46. doi: 10.1093/biostatistics/kxu038. Epub 2014 Aug 19.
The standard methods for detecting differential gene expression are mostly designed for analyzing a single gene expression experiment. When data from multiple related gene expression studies are available, separately analyzing each study is not ideal as it may fail to detect important genes with consistent but relatively weak differential signals in multiple studies. Jointly modeling all data allows one to borrow information across studies to improve the analysis. However, a simple concordance model, in which each gene is assumed to be differential in either all studies or none of the studies, is incapable of handling genes with study-specific differential expression. In contrast, a model that naively enumerates and analyzes all possible differential patterns across studies can deal with study-specificity and allow information pooling, but the complexity of its parameter space grows exponentially as the number of studies increases. Here, we propose a correlation motif approach to address this dilemma. This approach searches for a small number of latent probability vectors called correlation motifs to capture the major correlation patterns among multiple studies. The motifs provide the basis for sharing information among studies and genes. The approach has flexibility to handle all possible study-specific differential patterns. It improves detection of differential expression and overcomes the barrier of exponential model complexity.
检测差异基因表达的标准方法大多是为分析单个基因表达实验而设计的。当有多组相关基因表达研究的数据可用时,单独分析每个研究并不理想,因为这样可能无法检测到在多个研究中具有一致但相对较弱差异信号的重要基因。对所有数据进行联合建模可以让人们在不同研究之间借用信息以改进分析。然而,一个简单的一致性模型,即假设每个基因在所有研究中要么有差异,要么在所有研究中都没有差异,无法处理具有研究特异性差异表达的基因。相比之下,一个天真地枚举并分析所有可能的跨研究差异模式的模型可以处理研究特异性并允许信息合并,但其参数空间的复杂性会随着研究数量的增加呈指数增长。在这里,我们提出一种相关基序方法来解决这一困境。这种方法搜索少量称为相关基序的潜在概率向量,以捕捉多个研究之间的主要相关模式。这些基序为研究和基因之间共享信息提供了基础。该方法具有处理所有可能的研究特异性差异模式的灵活性。它提高了差异表达的检测能力,并克服了指数模型复杂性的障碍。