Oldham Michael C, Langfelder Peter, Horvath Steve
Department of Neurology, The Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, University of California, San Francisco, USA.
BMC Syst Biol. 2012 Jun 12;6:63. doi: 10.1186/1752-0509-6-63.
Genomic datasets generated by new technologies are increasingly prevalent in disparate areas of biological research. While many studies have sought to characterize relationships among genomic features, commensurate efforts to characterize relationships among biological samples have been less common. Consequently, the full extent of sample variation in genomic studies is often under-appreciated, complicating downstream analytical tasks such as gene co-expression network analysis.
Here we demonstrate the use of network methods for characterizing sample relationships in microarray data generated from human brain tissue. We describe an approach for identifying outlying samples that does not depend on the choice or use of clustering algorithms. We introduce a battery of measures for quantifying the consistency and integrity of sample relationships, which can be compared across disparate studies, technology platforms, and biological systems. Among these measures, we provide evidence that the correlation between the connectivity and the clustering coefficient (two important network concepts) is a sensitive indicator of homogeneity among biological samples. We also show that this measure, which we refer to as cor(K,C), can distinguish biologically meaningful relationships among subgroups of samples. Specifically, we find that cor(K,C) reveals the profound effect of Huntington's disease on samples from the caudate nucleus relative to other brain regions. Furthermore, we find that this effect is concentrated in specific modules of genes that are naturally co-expressed in human caudate nucleus, highlighting a new strategy for exploring the effects of disease on sets of genes.
These results underscore the importance of systematically exploring sample relationships in large genomic datasets before seeking to analyze genomic feature activity. We introduce a standardized platform for this purpose using freely available R software that has been designed to enable iterative and interactive exploration of sample networks.
新技术生成的基因组数据集在生物研究的不同领域越来越普遍。虽然许多研究试图描述基因组特征之间的关系,但对生物样本之间关系进行特征描述的相应努力却不太常见。因此,基因组研究中样本变异的全部程度常常未得到充分认识,这使得诸如基因共表达网络分析等下游分析任务变得复杂。
在此,我们展示了使用网络方法来描述从人类脑组织生成的微阵列数据中的样本关系。我们描述了一种识别异常样本的方法,该方法不依赖于聚类算法的选择或使用。我们引入了一系列用于量化样本关系的一致性和完整性的指标,这些指标可在不同研究、技术平台和生物系统之间进行比较。在这些指标中,我们提供证据表明连通性和聚类系数(两个重要的网络概念)之间的相关性是生物样本同质性的敏感指标。我们还表明,我们将其称为cor(K,C)的这一指标能够区分样本亚组之间具有生物学意义的关系。具体而言,我们发现cor(K,C)揭示了亨廷顿舞蹈症相对于其他脑区对尾状核样本的深远影响。此外,我们发现这种影响集中在人类尾状核中自然共表达的特定基因模块上,这突出了一种探索疾病对基因集影响的新策略。
这些结果强调了在试图分析基因组特征活性之前,系统探索大型基因组数据集中样本关系的重要性。我们为此目的引入了一个标准化平台,该平台使用免费的R软件,旨在实现对样本网络的迭代和交互式探索。