Baek Seungchul, Ho Yen-Yi, Ma Yanyuan
Department of Mathematics and Statistics, University of Maryland Baltimore County, Baltimore, Maryland.
Department of Statistics, University of South Carolina, Columbia, South Carolina.
Biometrics. 2020 Dec;76(4):1340-1350. doi: 10.1111/biom.13208. Epub 2020 Jan 6.
High-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes. Compared to existing approaches, our proposed procedure has several unique characteristics. In the first stage, an important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis and explicitly model the effects of genetic pathways on the latent factors. Second, to characterize the molecular heterogeneity of breast cancer, our approach provides estimates specific to each cancer subtype. Finally, our proposed framework incorporates sparsity condition due to the fact that genetic networks are often sparse. In the second stage, we investigate the relationship between latent factor activity levels and survival time with censoring using a general dimension reduction model in the survival analysis context. Combining the factor model and sufficient direction model provides an efficient way of analyzing high-dimensional data and reveals some interesting relations in the breast cancer gene expression data.
由于基因调控的协同作用,高维基因表达数据常常呈现出复杂的相关模式。然而在实际中,直接测量这些潜在的协同活动是困难的。对具有基因表达的乳腺癌生存数据进行分析,促使我们采用两阶段潜在因子方法来估计这些未观察到的协同生物过程。与现有方法相比,我们提出的方法具有几个独特的特点。在第一阶段,一个重要的区别是我们的方法将关于基因通路成员的先验生物学知识纳入分析,并明确地对遗传通路对潜在因子的影响进行建模。其次,为了刻画乳腺癌的分子异质性,我们的方法提供了针对每种癌症亚型的估计。最后,由于遗传网络通常是稀疏的,我们提出的框架纳入了稀疏条件。在第二阶段,我们在生存分析背景下使用一般的降维模型,研究潜在因子活性水平与带删失的生存时间之间的关系。结合因子模型和充分方向模型提供了一种分析高维数据的有效方法,并揭示了乳腺癌基因表达数据中的一些有趣关系。