Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark.
PLoS One. 2013 Sep 25;8(9):e72116. doi: 10.1371/journal.pone.0072116. eCollection 2013.
The abundance of high-dimensional measurements in the form of gene expression and mass spectroscopy calls for models to elucidate the underlying biological system. For widely studied organisms like yeast, it is possible to incorporate prior knowledge from a variety of databases, an approach used in several recent studies. However if such information is not available for a particular organism these methods fall short. In this paper we propose a statistical method that is applicable to a dataset consisting of Liquid Chromatography-Mass Spectroscopy (LC-MS) and gene expression (DNA microarray) measurements from the same samples, to identify genes controlling the production of metabolites. Due to the high dimensionality of both LC-MS and DNA microarray data, dimension reduction and variable selection are key elements of the analysis. Our proposed approach starts by identifying the basis functions ("building blocks") that constitute the output from a mass spectrometry experiment. Subsequently, the weights of these basis functions are related to the observations from the corresponding gene expression data in order to identify which genes are associated with specific patterns seen in the metabolite data. The modeling framework is extremely flexible as well as computationally fast and can accommodate treatment effects and other variables related to the experimental design. We demonstrate that within the proposed framework, genes regulating the production of specific metabolites can be identified correctly unless the variation in the noise is more than twice that of the signal.
大量的高维测量数据,如基因表达和质谱,需要模型来阐明潜在的生物系统。对于像酵母这样广泛研究的生物,可以从各种数据库中整合先验知识,这是最近几项研究中使用的方法。然而,如果对于特定的生物体没有这样的信息,这些方法就不够用了。在本文中,我们提出了一种统计方法,适用于由同一批样本的液相色谱-质谱(LC-MS)和基因表达(DNA 微阵列)测量数据组成的数据集,以识别控制代谢物产生的基因。由于 LC-MS 和 DNA 微阵列数据的高维性,降维和变量选择是分析的关键要素。我们提出的方法首先从构成质谱实验输出的基函数(“构建块”)开始识别。然后,这些基函数的权重与相应基因表达数据中的观察值相关联,以确定哪些基因与代谢物数据中出现的特定模式有关。该建模框架非常灵活,计算速度也很快,可以适应与实验设计相关的处理效应和其他变量。我们证明,在提出的框架内,除非噪声的变化是信号的两倍以上,否则可以正确识别调节特定代谢物产生的基因。