Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America.
PLoS One. 2012;7(5):e37595. doi: 10.1371/journal.pone.0037595. Epub 2012 May 18.
Pathway analysis of a set of genes represents an important area in large-scale omic data analysis. However, the application of traditional pathway enrichment methods to next-generation sequencing (NGS) data is prone to several potential biases, including genomic/genetic factors (e.g., the particular disease and gene length) and environmental factors (e.g., personal life-style and frequency and dosage of exposure to mutagens). Therefore, novel methods are urgently needed for these new data types, especially for individual-specific genome data.
In this study, we proposed a novel method for the pathway analysis of NGS mutation data by explicitly taking into account the gene-wise mutation rate. We estimated the gene-wise mutation rate based on the individual-specific background mutation rate along with the gene length. Taking the mutation rate as a weight for each gene, our weighted resampling strategy builds the null distribution for each pathway while matching the gene length patterns. The empirical P value obtained then provides an adjusted statistical evaluation.
PRINCIPAL FINDINGS/CONCLUSIONS: We demonstrated our weighted resampling method to a lung adenocarcinomas dataset and a glioblastoma dataset, and compared it to other widely applied methods. By explicitly adjusting gene-length, the weighted resampling method performs as well as the standard methods for significant pathways with strong evidence. Importantly, our method could effectively reject many marginally significant pathways detected by standard methods, including several long-gene-based, cancer-unrelated pathways. We further demonstrated that by reducing such biases, pathway crosstalk for each individual and pathway co-mutation map across multiple individuals can be objectively explored and evaluated. This method performs pathway analysis in a sample-centered fashion, and provides an alternative way for accurate analysis of cancer-personalized genomes. It can be extended to other types of genomic data (genotyping and methylation) that have similar bias problems.
一组基因的通路分析是大规模组学数据分析中的一个重要领域。然而,传统的通路富集方法在应用于下一代测序(NGS)数据时容易受到几个潜在的偏差的影响,包括基因组/遗传因素(例如,特定疾病和基因长度)和环境因素(例如,个人生活方式和接触诱变剂的频率和剂量)。因此,这些新的数据类型,特别是针对个体特定的基因组数据,迫切需要新的方法。
在这项研究中,我们提出了一种新的方法,通过明确考虑基因的突变率,对 NGS 突变数据进行通路分析。我们基于个体特定的背景突变率以及基因长度来估计基因的突变率。通过将突变率作为每个基因的权重,我们的加权重抽样策略在匹配基因长度模式的同时,为每个通路构建了零分布。然后,通过获得的经验 P 值提供了调整后的统计评估。
主要发现/结论:我们将加权重抽样方法应用于肺腺癌数据集和胶质母细胞瘤数据集,并将其与其他广泛应用的方法进行了比较。通过明确调整基因长度,加权重抽样方法在具有强证据的显著通路中表现得与标准方法一样好。重要的是,我们的方法可以有效地拒绝标准方法检测到的许多边缘显著通路,包括几个基于长基因的、与癌症无关的通路。我们进一步证明,通过减少这种偏差,可以客观地探索和评估每个个体的通路串扰和多个个体的通路共突变图谱。该方法以样本为中心进行通路分析,为准确分析癌症个体化基因组提供了一种替代方法。它可以扩展到具有类似偏差问题的其他类型的基因组数据(基因分型和甲基化)。