Department of Medical Statistics, and Department of Nephrology and Rheumatology, University Medical Center Göttingen, Göttingen 37099, Germany.
Bioinformatics. 2014 May 15;30(10):1424-30. doi: 10.1093/bioinformatics/btu062. Epub 2014 Jan 30.
Global test procedures are frequently used in gene expression analysis to study the relationship between a functional subset of RNA transcripts and an experimental group factor. However, these procedures have been rarely used for the analysis of high-throughput data from other sources, such as proteome expression data. The main difficulties in transferring global test procedures from genomics to proteomics data are the more complicated way of obtaining functional annotations and the handling of missing values in some types of proteomics data.
We propose a simple mixed linear model in combination with a permutation procedure and missing values imputation to conduct global tests in proteomics experiments. This new approach is motivated by protein expression data obtained by means of 2-D gel electrophoresis within a mouse experiment of our current research. A simulation study yielded that power and testing level of the mixed model alone can be affected by missing values in the dataset. Imputation of missing values was able to correct for a bias in some simulation settings. Our new approach provides the possibility to rank Gene Ontology (GO) terms associated with protein sets. It is also helpful in the case in which a specific protein is represented by multiple spots on a 2-D gel by considering these spots also as a protein set. Analysis of our data points at correlations between the deficiency of the protein 'calreticulin' and protein sets related to biological processes in the heart muscle.
Our proposed approach is included in the R-package 'RepeatedHighDim', which already contains a global test procedure for gene expression data. The package can be retrieved from http://cran.r-project.org/.
全局检验程序常用于基因表达分析,以研究功能 RNA 转录本子集与实验因子之间的关系。然而,这些程序很少用于分析其他来源的高通量数据,如蛋白质组表达数据。将全局检验程序从基因组学转移到蛋白质组学数据时的主要困难是获得功能注释的方式更为复杂,以及某些类型的蛋白质组学数据中存在缺失值的处理问题。
我们提出了一种简单的混合线性模型,结合置换程序和缺失值插补,用于进行蛋白质组学实验中的全局检验。这种新方法的动机来自于我们当前研究中的一个小鼠实验中通过 2-D 凝胶电泳获得的蛋白质表达数据。模拟研究表明,混合模型本身的功效和检验水平可能会受到数据集缺失值的影响。缺失值的插补可以纠正某些模拟设置中的偏差。我们的新方法提供了对与蛋白质组相关的基因本体 (GO) 术语进行排序的可能性。在特定蛋白质由 2-D 凝胶上的多个斑点表示的情况下,这种方法也很有帮助,可以将这些斑点也视为一个蛋白质组。通过分析我们的数据点,可以研究蛋白质 '钙网蛋白' 缺乏与心肌中与生物学过程相关的蛋白质组之间的相关性。
我们提出的方法包含在 R 包 'RepeatedHighDim' 中,该包已经包含了一个用于基因表达数据的全局检验程序。该包可以从 http://cran.r-project.org/ 获取。