The Genome Institute, Washington University, St Louis, MO 63108, USA.
Bioinformatics. 2011 Jun 15;27(12):1595-602. doi: 10.1093/bioinformatics/btr193. Epub 2011 Apr 14.
The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate.
Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher-Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each.
As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies.
PathScan is implemented in Perl and is available from the Genome Institute at: http://genome.wustl.edu/software/pathscan.
癌症基因组测序的扩展不断激发分析工具的发展,以推断体细胞变化与肿瘤发展之间的关系。途径关联尤其重要,但现有的算法显然不足。
在这里,我们提出了PathScan 显著性检验,用于途径突变共同促进肿瘤发展的情况。它的设计解决了已有方法忽视的两个方面。首先,我们考虑了基因长度的变化,以及在随机突变标准零假设下基因突变概率的相应差异。通过准确的卷积近似,可以缓解相关的计算量增加。其次,我们使用 Fisher-Lancaster 理论将个体概率组合成多个样本值,从而提高了对少数高度突变基因和许多每个基因只有少数突变的基因的区分能力。我们研究了准确性、计算效率和功效,报告了每种方法的可接受性能。
作为一个示例计算,我们重新分析了来自肿瘤测序项目的基于 KEGG 的肺腺癌途径突变。我们的测试重现了最显著的途径,并发现原始测试电池不确定的其他途径实际上并不显著。它还确定了粘着斑途径发生了显著突变,这一发现与早期研究一致。我们还将此分析扩展到其他数据库:Reactome、BioCarta、Pfam、PID 和 SMART,在 ErbB 和 EphA 信号通路以及端粒酶调节中发现了其他命中。所有这些都对癌症有影响和合理的机制作用。最后,我们讨论了将该方法扩展为整合基因特异性背景率和其他类型遗传异常的方面。
PathScan 是用 Perl 实现的,可从基因组研究所获得:http://genome.wustl.edu/software/pathscan。