Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, Meibergdreef 9, 1100 DD Amsterdam, The Netherlands.
Bioinformatics. 2009 Nov 1;25(21):2764-71. doi: 10.1093/bioinformatics/btp491. Epub 2009 Aug 17.
Canonical correlation analysis (CCA) can be used to capture the underlying genetic background of a complex disease, by associating two datasets containing information about a patient's phenotypical and genetic details. Often the genetic information is measured on a qualitative scale, consequently ordinary CCA cannot be applied to such data. Moreover, the size of the data in genetic studies can be enormous, thereby making the results difficult to interpret.
We developed a penalized non-linear CCA approach that can deal with qualitative data by transforming each qualitative variable into a continuous variable through optimal scaling. Additionally, sparse results were obtained by adapting soft-thresholding to this non-linear version of the CCA. By means of simulation studies, we show that our method is capable of extracting relevant variables out of high-dimensional sets. We applied our method to a genetic dataset containing 144 patients with glial cancer.
典型相关分析(CCA)可用于通过关联包含患者表型和遗传细节信息的两个数据集,来捕捉复杂疾病的潜在遗传背景。通常,遗传信息是在定性尺度上测量的,因此普通的 CCA 不能应用于此类数据。此外,遗传研究中的数据量可能非常大,从而使结果难以解释。
我们开发了一种惩罚非线性 CCA 方法,通过最优标度将每个定性变量转换为连续变量,从而可以处理定性数据。此外,通过将软阈值应用于 CCA 的这种非线性版本,获得了稀疏结果。通过模拟研究,我们证明了我们的方法能够从高维集合中提取相关变量。我们将我们的方法应用于包含 144 名胶质母细胞瘤患者的遗传数据集。