Teran Hidalgo Sebastian J, Wu Mengyun, Ma Shuangge
Department of Biostatistics, Yale University, 60 College Street, New Haven, 06520, USA.
School of Statistics and Management, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai, 200433, China.
BMC Genomics. 2017 Aug 16;18(1):623. doi: 10.1186/s12864-017-3990-1.
In biomedical research, gene expression profiling studies have been extensively conducted. The analysis of gene expression data has led to a deeper understanding of human genetics as well as practically useful models. Clustering analysis has been a critical component of gene expression data analysis and can reveal the (previously unknown) interconnections among genes. With the high dimensionality of gene expression data, many of the existing clustering methods and results are not as satisfactory. Intuitively, this is caused by "a lack of information". In recent profiling studies, a prominent trend is to collect data on gene expressions as well as their regulators (copy number alteration, microRNA, methylation, etc.) on the same subjects, making it possible to borrow information from other types of omics measurements in gene expression analysis.
In this study, an ANCut approach is developed, which is built on the regularized estimation and NCut techniques. An effective R code that implements this approach is developed.
Simulation shows that the proposed approach outperforms direct competitors. The analysis of TCGA (The Cancer Genome Atlas) data further demonstrates its satisfactory performance.
We propose a more effective clustering analysis of gene expression data, with the assistance of information from regulators. It provides a new venue for analyzing gene expression data based on the assisted analysis strategy.
在生物医学研究中,基因表达谱研究已广泛开展。基因表达数据分析有助于更深入地理解人类遗传学,并建立具有实际应用价值的模型。聚类分析一直是基因表达数据分析的关键组成部分,能够揭示基因之间(此前未知的)相互联系。由于基因表达数据具有高维性,许多现有的聚类方法和结果并不尽如人意。直观地说,这是由“信息不足”导致的。在最近的谱研究中,一个显著趋势是在同一研究对象上收集基因表达及其调控因子(拷贝数变异、微小RNA、甲基化等)的数据,从而在基因表达分析中能够从其他类型的组学测量中借用信息。
在本研究中,我们开发了一种基于正则化估计和归一化割(NCut)技术的ANCut方法,并编写了实现该方法的有效R代码。
模拟结果表明,所提出的方法优于直接竞争对手。对癌症基因组图谱(TCGA)数据的分析进一步证明了其令人满意的性能。
我们借助调控因子的信息,提出了一种更有效的基因表达数据聚类分析方法。它为基于辅助分析策略的基因表达数据分析提供了新途径。