Department of Integrative Oncology, British Columbia Cancer Agency Research Centre, Vancouver, BC, Canada.
BMC Med Genomics. 2010 Aug 3;3:32. doi: 10.1186/1755-8794-3-32.
An important consideration when analyzing both microarray and quantitative PCR expression data is the selection of appropriate genes as endogenous controls or reference genes. This step is especially critical when identifying genes differentially expressed between datasets. Moreover, reference genes suitable in one context (e.g. lung cancer) may not be suitable in another (e.g. breast cancer). Currently, the main approach to identify reference genes involves the mining of expression microarray data for highly expressed and relatively constant transcripts across a sample set. A caveat here is the requirement for transcript normalization prior to analysis, and measurements obtained are relative, not absolute. Alternatively, as sequencing-based technologies provide digital quantitative output, absolute quantification ensues, and reference gene identification becomes more accurate.
Serial analysis of gene expression (SAGE) profiles of non-malignant and malignant lung samples were compared using a permutation test to identify the most stably expressed genes across all samples. Subsequently, the specificity of the reference genes was evaluated across multiple tissue types, their constancy of expression was assessed using quantitative RT-PCR (qPCR), and their impact on differential expression analysis of microarray data was evaluated.
We show that (i) conventional references genes such as ACTB and GAPDH are highly variable between cancerous and non-cancerous samples, (ii) reference genes identified for lung cancer do not perform well for other cancer types (breast and brain), (iii) reference genes identified through SAGE show low variability using qPCR in a different cohort of samples, and (iv) normalization of a lung cancer gene expression microarray dataset with or without our reference genes, yields different results for differential gene expression and subsequent analyses. Specifically, key established pathways in lung cancer exhibit higher statistical significance using a dataset normalized with our reference genes relative to normalization without using our reference genes.
Our analyses found NDUFA1, RPL19, RAB5C, and RPS18 to occupy the top ranking positions among 15 suitable reference genes optimal for normalization of lung tissue expression data. Significantly, the approach used in this study can be applied to data generated using new generation sequencing platforms for the identification of reference genes optimal within diverse contexts.
在分析微阵列和定量 PCR 表达数据时,一个重要的考虑因素是选择合适的基因作为内参或参考基因。在识别数据集之间差异表达的基因时,这一步尤其关键。此外,在一种情况下(例如肺癌)适用的参考基因在另一种情况下(例如乳腺癌)可能并不适用。目前,识别参考基因的主要方法是从表达微阵列数据中挖掘高度表达且在样本集中相对稳定的转录本。这里需要注意的是,在进行分析之前需要进行转录本归一化,并且获得的测量值是相对的,而不是绝对的。或者,随着基于测序的技术提供数字定量输出,绝对定量随之而来,参考基因的识别变得更加准确。
使用置换检验比较非恶性和恶性肺样本的基因表达序列分析 (SAGE) 图谱,以确定所有样本中表达最稳定的基因。随后,在多个组织类型中评估参考基因的特异性,使用定量 RT-PCR (qPCR) 评估其表达的稳定性,并评估其对微阵列数据差异表达分析的影响。
我们表明:(i)ACTB 和 GAPDH 等常规参考基因在癌性和非癌性样本之间高度可变,(ii)为肺癌确定的参考基因在其他癌症类型(乳腺和脑)中表现不佳,(iii)通过 SAGE 确定的参考基因在不同样本队列中使用 qPCR 显示出低变异性,以及(iv)使用或不使用我们的参考基因对肺癌基因表达微阵列数据集进行归一化会产生不同的差异基因表达和后续分析结果。具体而言,与不使用我们的参考基因进行归一化相比,使用我们的参考基因归一化肺癌基因表达微阵列数据集会使肺癌中关键的已建立途径表现出更高的统计学意义。
我们的分析发现 NDUFA1、RPL19、RAB5C 和 RPS18 在 15 个适合肺组织表达数据归一化的候选参考基因中排名最高。重要的是,本研究中使用的方法可应用于使用新一代测序平台生成的数据,以确定不同情况下最佳的参考基因。