Vuong Huy, Shedden Kerby, Liu Yashu, Lubman David M
Bioinformatics Program, University of Michigan, MI, USA.
J Proteomics Bioinform. 2011 Jun 18;4(6):116-122. doi: 10.4172/jpb.1000177.
An active area in cancer biomarker research is the development of statistical methods to identify expression signatures reflecting the heterogeneity of cancer across affected individuals. Tomlins et al. [5] observed heterogeneous patterns of oncogene activation within several cancer types, and introduced a statistical method called Cancer Outlier Profile Analysis (COPA) to identify "cancer outlier genes". Several related statistical approaches have since been developed, but the operating characteristics of these procedures (e.g. power, false positive rate), have not yet been fully characterized, especially in a proteomics setting. Here, we use simulation to identify the degree to which an outlier pattern of differential expression must hold in order for outlier-based approaches to be more effective than mean-based approaches. We also propose a diagnostic procedure that characterizes the potentially unequal levels of differential expression in the tails and in the center of a distribution of expression values. We find that for sample sizes and effect sizes typical of proteomics studies, the outlier pattern must be strong in order for outlier-based analysis to provide a meaningful benefit. This is corroborated by analysis of proteomics data from a melanoma study, in which the differential expression is most often present throughout the distribution, rather than being concentrated in the tails, albeit with a few proteins showing expression patterns consistent with outlier expression.
癌症生物标志物研究中的一个活跃领域是开发统计方法,以识别反映癌症在受影响个体间异质性的表达特征。Tomlins等人[5]观察到几种癌症类型中癌基因激活的异质性模式,并引入了一种名为癌症异常值分析(COPA)的统计方法来识别“癌症异常值基因”。此后又开发了几种相关的统计方法,但这些方法的操作特性(如检验效能、假阳性率)尚未得到充分描述,尤其是在蛋白质组学环境中。在这里,我们通过模拟来确定差异表达的异常值模式必须在何种程度上成立,以便基于异常值的方法比基于均值的方法更有效。我们还提出了一种诊断程序,用于描述表达值分布尾部和中心潜在的不平等差异表达水平。我们发现,对于蛋白质组学研究典型的样本量和效应量,异常值模式必须很强,基于异常值的分析才能提供有意义的益处。对一项黑色素瘤研究的蛋白质组学数据分析证实了这一点,在该研究中,差异表达最常出现在整个分布中,而不是集中在尾部,尽管有少数蛋白质显示出与异常值表达一致的表达模式。