Department of Statistics, Iowa State University, Ames, Iowa, USA.
Gilead Sciences, Foster City, California, USA.
Stat Med. 2022 Oct 30;41(24):4924-4940. doi: 10.1002/sim.9544. Epub 2022 Aug 15.
Causal relationships are of crucial importance for biological and medical research. Algorithms have been proposed for causal structure learning with graphical visualizations. While much of the literature focuses on biological studies where data often follow the same distribution, for example, the normal distribution for all variables, challenges emerge from epidemiological and clinical studies where data are often mixed with continuous, binary, and ordinal variables. We propose to use a mixed latent Gaussian copula model to estimate the underlying correlation structure via the rank correlation for mixed data. This correlation structure is then incorporated into a popular causal discovery algorithm, the PC algorithm, to identify causal structures. The proposed algorithm, called the latent-PC algorithm, is able to discover the true causal structure consistently under mild conditions in high dimensional settings. From simulation studies, the latent-PC algorithm delivers a competitive performance in terms of a similar or higher true positive rate and a similar or lower false positive rate, compared with other variants of the PC algorithm. In the high dimensional settings where the number of variables is more than the number of observations, the causal graphs identified by the latent-PC algorithm are closer to the true causal structures, compared to other competing algorithms. Further, we demonstrate the utility of the latent-PC algorithm in a real dataset for hepatocellular carcinoma. Causal structures for patient survival are visualized and connected with clinical interpretations in the literature.
因果关系对于生物医学研究至关重要。已经提出了一些算法来进行具有图形可视化的因果结构学习。虽然大部分文献都集中在生物学研究上,这些研究中的数据通常遵循相同的分布,例如所有变量的正态分布,但来自流行病学和临床研究的数据通常是混合的,包括连续、二值和有序变量,这就带来了挑战。我们建议使用混合潜在高斯 Copula 模型通过混合数据的秩相关来估计潜在的相关结构。然后,将这种相关结构纳入一种流行的因果发现算法——PC 算法中,以识别因果结构。所提出的算法称为潜在-PC 算法,在高维环境下的温和条件下能够一致地发现真实的因果结构。通过模拟研究,与 PC 算法的其他变体相比,潜在-PC 算法在类似或更高的真阳性率和类似或更低的假阳性率方面表现出了有竞争力的性能。在变量数量超过观测数量的高维环境中,与其他竞争算法相比,潜在-PC 算法识别的因果图更接近真实的因果结构。此外,我们在肝细胞癌的真实数据集上展示了潜在-PC 算法的实用性。可视化了患者生存的因果结构,并与文献中的临床解释联系起来。