Wang Y X Rachel, Waterman Michael S, Huang Haiyan
Department of Statistics, University of California, Berkeley, CA 94720; and.
Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089
Proc Natl Acad Sci U S A. 2014 Nov 18;111(46):16371-6. doi: 10.1073/pnas.1417128111. Epub 2014 Oct 6.
With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.
随着高通量技术的出现,大规模基因表达数据变得 readily available,开发合适的计算工具来处理这些数据并将见解提炼到系统生物学中,一直是“大数据”挑战的重要组成部分。基因共表达是最早开发的技术之一,至今仍广泛用于功能注释、通路分析,最重要的是基于基因表达数据重建基因调控网络。然而,大多数共表达度量并未专门考虑表达谱中的局部特征。例如,基因关联模式很可能会改变或仅存在于一部分样本中,尤其是当样本来自一系列实验时。我们提出了两种基于计算基因表达秩的局部模式的新基因共表达统计量,以考虑基因相互作用潜在的多样性。特别是,我们的一种统计量是为具有局部依赖结构的时间进程数据设计的,例如在时域子区域上耦合的时间序列。我们提供了它们分布和功效的渐近分析,并在模拟和真实数据上针对各种现有的共表达度量评估了它们的性能。我们的新统计量计算速度快,对异常值具有鲁棒性,并且表现出可比的且通常更好的总体性能。