Chi Calvin, Ye Yuting, Chen Bin, Huang Haiyan
Center of Computational Biology, College of Engineering, University of California, Berkeley, CA 94720, USA.
Division of Biostatistics, University of California, Berkeley, CA 94720, USA.
Bioinformatics. 2021 Sep 9;37(17):2617-2626. doi: 10.1093/bioinformatics/btab143.
In pharmacogenomic studies, the biological context of cell lines influences the predictive ability of drug-response models and the discovery of biomarkers. Thus, similar cell lines are often studied together based on prior knowledge of biological annotations. However, this selection approach is not scalable with the number of annotations, and the relationship between gene-drug association patterns and biological context may not be obvious.
We present a procedure to compare cell lines based on their gene-drug association patterns. Starting with a grouping of cell lines from biological annotation, we model gene-drug association patterns for each group as a bipartite graph between genes and drugs. This is accomplished by applying sparse canonical correlation analysis (SCCA) to extract the gene-drug associations, and using the canonical vectors to construct the edge weights. Then, we introduce a nuclear norm-based dissimilarity measure to compare the bipartite graphs. Accompanying our procedure is a permutation test to evaluate the significance of similarity of cell line groups in terms of gene-drug associations. In the pharmacogenomic datasets CTRP2, GDSC2 and CCLE, hierarchical clustering of carcinoma groups based on this dissimilarity measure uniquely reveals clustering patterns driven by carcinoma subtype rather than primary site. Next, we show that the top associated drugs or genes from SCCA can be used to characterize the clustering patterns of haematopoietic and lymphoid malignancies. Finally, we confirm by simulation that when drug responses are linearly dependent on expression, our approach is the only one that can effectively infer the true hierarchy compared to existing approaches.
Bipartite graph-based hierarchical clustering is implemented in R and can be obtained from CRAN: https://CRAN.R-project.org/package=hierBipartite. The source code is available at https://github.com/CalvinTChi/hierBipartite. The datasets were derived from sources in the public domain, which are the Cancer Cell Line Encyclopedia (https://portals.broadinstitute.org/ccle), the Cancer Therapeutics Response Portal (https://portals.broadinstitute.org/ctrp.v2.1/?page=#ctd2BodyHome), and the Genomics of Drug Sensitivity in Cancer (https://www.cancerrxgene.org/). These datasets can be downloaded using the PharmacoGx R package (https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html).
Supplementary data are available at Bioinformatics online.
在药物基因组学研究中,细胞系的生物学背景会影响药物反应模型的预测能力以及生物标志物的发现。因此,基于生物学注释的先验知识,常将相似的细胞系放在一起研究。然而,这种选择方法随着注释数量的增加而不可扩展,并且基因 - 药物关联模式与生物学背景之间的关系可能并不明显。
我们提出了一种基于基因 - 药物关联模式比较细胞系的方法。从基于生物学注释对细胞系进行分组开始,我们将每组的基因 - 药物关联模式建模为基因与药物之间的二分图。这通过应用稀疏典型相关分析(SCCA)来提取基因 - 药物关联,并使用典型向量构建边权重来实现。然后,我们引入一种基于核范数的差异度量来比较二分图。与我们的方法配套的是一个置换检验,用于评估细胞系组在基因 - 药物关联方面相似性 的显著性。在药物基因组学数据集CTRP2、GDSC2和CCLE中,基于这种差异度量对癌组进行层次聚类,独特地揭示了由癌亚型而非原发部位驱动的聚类模式。接下来,我们表明SCCA中 top 相关药物或基因可用于表征造血和淋巴系统恶性肿瘤的聚类模式。最后,我们通过模拟证实,当药物反应与表达呈线性相关时,与现有方法相比,我们的方法是唯一能够有效推断真实层次结构的方法。
基于二分图的层次聚类在R中实现,可从CRAN获取:https://CRAN.R - project.org/package = hierBipartite。源代码可在https://github.com/CalvinTChi/hierBipartite获取。数据集来自公共领域的来源,即癌细胞系百科全书(https://portals.broadinstitute.org/ccle)、癌症治疗反应门户(https://portals.broadinstitute.org/ctrp.v2.1/?page=#ctd2BodyHome)和癌症药物敏感性基因组学(https://www.cancerrxgene.org/)。这些数据集可使用PharmacoGx R包(https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html)下载。
补充数据可在《生物信息学》在线获取。