Alakuş Cansu, Larocque Denis, Jacquemont Sébastien, Barlaam Fanny, Martin Charles-Olivier, Agbogba Kristian, Lippé Sarah, Labbe Aurélie
Department of Decision Sciences, HEC Montréal, Montréal, QC H3T 2A7, Canada.
Department of Pediatrics, Université de Montréal, Montréal, QC H3T 1C5, Canada.
Bioinformatics. 2021 Sep 9;37(17):2714-2721. doi: 10.1093/bioinformatics/btab158.
Investigating the relationships between two sets of variables helps to understand their interactions and can be done with canonical correlation analysis (CCA). However, the correlation between the two sets can sometimes depend on a third set of covariates, often subject-related ones such as age, gender or other clinical measures. In this case, applying CCA to the whole population is not optimal and methods to estimate conditional CCA, given the covariates, can be useful.
We propose a new method called Random Forest with Canonical Correlation Analysis (RFCCA) to estimate the conditional canonical correlations between two sets of variables given subject-related covariates. The individual trees in the forest are built with a splitting rule specifically designed to partition the data to maximize the canonical correlation heterogeneity between child nodes. We also propose a significance test to detect the global effect of the covariates on the relationship between two sets of variables. The performance of the proposed method and the global significance test is evaluated through simulation studies that show it provides accurate canonical correlation estimations and well-controlled Type-1 error. We also show an application of the proposed method with EEG data.
RFCCA is implemented in a freely available R package on CRAN (https://CRAN.R-project.org/package=RFCCA).
Supplementary data are available at Bioinformatics online.
研究两组变量之间的关系有助于理解它们的相互作用,这可以通过典型相关分析(CCA)来实现。然而,两组变量之间的相关性有时可能取决于第三组协变量,通常是与受试者相关的协变量,如年龄、性别或其他临床指标。在这种情况下,对整个人口应用CCA并非最优选择,而估计给定协变量条件下的条件CCA的方法可能会很有用。
我们提出了一种名为随机森林典型相关分析(RFCCA)的新方法,用于估计给定与受试者相关协变量时两组变量之间的条件典型相关性。森林中的各个树是根据专门设计的分裂规则构建的,该规则用于对数据进行划分,以使子节点之间的典型相关异质性最大化。我们还提出了一种显著性检验,以检测协变量对两组变量之间关系的全局影响。通过模拟研究评估了所提出方法和全局显著性检验的性能,结果表明它提供了准确的典型相关估计,并很好地控制了第一类错误。我们还展示了所提出方法在脑电图数据中的应用。
RFCCA在CRAN上的一个免费可用的R包中实现(https://CRAN.R-project.org/package=RFCCA)。
补充数据可在《生物信息学》在线获取。