Emory University, Atlanta.
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):1080-5. doi: 10.1109/TCBB.2013.99.
High-throughput expression technologies, including gene expression array and liquid chromatography--mass spectrometry (LC-MS) and so on, measure thousands of features, i.e., genes or metabolites, on a continuous scale. In such data, both linear and nonlinear relations exist between features. Nonlinear relations can reflect critical regulation patterns in the biological system. However, they are not identified and utilized by traditional clustering methods based on linear associations. Clustering based on general dependences, i.e., both linear and nonlinear relations, is hampered by the high dimensionality and high noise level of the data. We developed a sensitive nonparametric measure of general dependence between (groups of) random variables in high dimensions. Based on this dependence measure, we developed a hierarchical clustering method. In simulation studies, the method outperformed correlation- and mutual information (MI)-based hierarchical clustering methods in clustering features with nonlinear dependences. We applied the method to a microarray data set measuring the gene expression in cell-cycle time series to show it generates biologically relevant results. The R code is available at http://userwww.service.emory.edu/~tyu8/GDHC.
高通量表达技术,包括基因表达阵列和液相色谱-质谱(LC-MS)等,可在连续尺度上测量数千个特征,即基因或代谢物。在这些数据中,特征之间存在线性和非线性关系。非线性关系可以反映生物系统中的关键调节模式。然而,基于线性关联的传统聚类方法无法识别和利用这些关系。基于广义依赖性(即线性和非线性关系)的聚类受到数据的高维性和高噪声水平的限制。我们开发了一种在高维中测量(组)随机变量之间广义相关性的敏感非参数度量方法。基于这个依赖度量,我们开发了一种层次聚类方法。在模拟研究中,该方法在聚类具有非线性依赖关系的特征方面优于基于相关性和互信息(MI)的层次聚类方法。我们将该方法应用于测量细胞周期时间序列中基因表达的微阵列数据集,以显示其产生的生物学相关结果。R 代码可在 http://userwww.service.emory.edu/~tyu8/GDHC 获得。