Life Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.
Bioinformatics. 2010 Jun 15;26(12):i97-105. doi: 10.1093/bioinformatics/btq181.
Molecular association of phenotypic responses is an important step in hypothesis generation and for initiating design of new experiments. Current practices for associating gene expression data with multidimensional phenotypic data are typically (i) performed one-to-one, i.e. each gene is examined independently with a phenotypic index and (ii) tested with one stress condition at a time, i.e. different perturbations are analyzed separately. As a result, the complex coordination among the genes responsible for a phenotypic profile is potentially lost. More importantly, univariate analysis can potentially hide new insights into common mechanism of response.
In this article, we propose a sparse, multitask regression model together with co-clustering analysis to explore the intrinsic grouping in associating the gene expression with phenotypic signatures. The global structure of association is captured by learning an intrinsic template that is shared among experimental conditions, with local perturbations introduced to integrate effects of therapeutic agents. We demonstrate the performance of our approach on both synthetic and experimental data. Synthetic data reveal that the multi-task regression has a superior reduction in the regression error when compared with traditional L(1)-and L(2)-regularized regression. On the other hand, experiments with cell cycle inhibitors over a panel of 14 breast cancer cell lines demonstrate the relevance of the computed molecular predictors with the cell cycle machinery, as well as the identification of hidden variables that are not captured by the baseline regression analysis. Accordingly, the system has identified CLCA2 as a hidden transcript and as a common mechanism of response for two therapeutic agents of CI-1040 and Iressa, which are currently in clinical use.
表型反应的分子关联是生成假设和启动新实验设计的重要步骤。目前,将基因表达数据与多维表型数据相关联的实践通常是(i)一对一进行的,即每个基因都与一个表型指数独立进行检查,(ii)一次测试一种应激条件,即分别分析不同的扰动。因此,负责表型谱的基因之间的复杂协调可能会丢失。更重要的是,单变量分析可能会隐藏对共同反应机制的新见解。
在本文中,我们提出了一种稀疏的多任务回归模型和协同聚类分析,以探索将基因表达与表型特征相关联的内在分组。通过学习共享实验条件的内在模板来捕获关联的全局结构,引入局部扰动来整合治疗剂的影响。我们在合成数据和实验数据上演示了我们方法的性能。合成数据表明,与传统的 L(1)-和 L(2)-正则化回归相比,多任务回归在回归误差的减少方面具有优势。另一方面,在 14 种乳腺癌细胞系的细胞周期抑制剂实验中,计算出的分子预测因子与细胞周期机制的相关性以及对基线回归分析未捕获的隐藏变量的识别,证明了该系统的相关性。因此,该系统将 CLCA2 确定为隐藏的转录物和两种治疗剂 CI-1040 和 Iressa 的共同反应机制,这两种治疗剂目前正在临床使用。