School of Medicine, Cardiff University, Heath Park, Cardiff CF144XN, UK.
BMC Bioinformatics. 2010 Jan 11;11:19. doi: 10.1186/1471-2105-11-19.
Theme-driven cancer survival studies address whether the expression signature of genes related to a biological process can predict patient survival time. Although this should ideally be achieved by testing two separate null hypotheses, current methods treat both hypotheses as one. The first test should assess whether a geneset, independent of its composition, is associated with prognosis (frequently done with a survival test). The second test then verifies whether the theme of the geneset is relevant (usually done with an empirical test that compares the geneset of interest with random genesets). Current methods do not test this second null hypothesis because it has been assumed that the distribution of p-values for random genesets (when tested against the first null hypothesis) is uniform. Here we demonstrate that such an assumption is generally incorrect and consequently, such methods may erroneously associate the biology of a particular geneset with cancer prognosis.
To assess the impact of non-uniform distributions for random genesets in such studies, an automated theme-driven method was developed. This method empirically approximates the p-value distribution of sets of unrelated genes based on a permutation approach, and tests whether predefined sets of biologically-related genes are associated with survival. The results from a comparison with a published theme-driven approach revealed non-uniform distributions, suggesting a significant problem exists with false positive rates in the original study. When applied to two public cancer datasets our technique revealed novel ontological categories with prognostic power, including significant correlations between "fatty acid metabolism" with overall survival in breast cancer, as well as "receptor mediated endocytosis", "brain development", "apical plasma membrane" and "MAPK signaling pathway" with overall survival in lung cancer.
Current methods of theme-driven survival studies assume uniformity of p-values for random genesets, which can lead to false conclusions. Our approach provides a method to correct for this pitfall, and provides a novel route to identifying higher-level biological themes and pathways with prognostic power in clinical microarray datasets.
以主题为导向的癌症生存研究旨在探讨与生物学过程相关的基因表达特征是否可以预测患者的生存时间。虽然这在理想情况下应通过检验两个独立的零假设来实现,但目前的方法将这两个假设视为一个整体。第一个检验应评估基因集(独立于其组成)是否与预后相关(通常通过生存检验完成)。然后,第二个检验验证基因集的主题是否相关(通常通过与随机基因集进行比较的经验检验来完成)。目前的方法并未检验第二个零假设,因为人们假设随机基因集的 p 值分布(当针对第一个零假设进行检验时)是均匀的。在这里,我们证明这种假设通常是不正确的,因此,这些方法可能会错误地将特定基因集的生物学与癌症预后联系起来。
为了评估此类研究中随机基因集非均匀分布的影响,开发了一种自动化的主题驱动方法。该方法基于随机排列方法,对不相关基因集的 p 值分布进行经验近似,并检验与生存相关的预定义生物学相关基因集是否相关。与已发表的主题驱动方法的比较结果表明存在非均匀分布,这表明原始研究中存在显著的假阳性率问题。当应用于两个公共癌症数据集时,我们的技术揭示了具有预后能力的新的本体论类别,包括乳腺癌中“脂肪酸代谢”与总生存期之间的显著相关性,以及肺癌中“受体介导的内吞作用”、“脑发育”、“顶端质膜”和“MAPK 信号通路”与总生存期之间的显著相关性。
目前的主题驱动生存研究方法假设随机基因集的 p 值分布是均匀的,这可能导致错误的结论。我们的方法提供了一种纠正这种缺陷的方法,并为在临床微阵列数据集中识别具有预后能力的更高层次生物学主题和途径提供了新途径。