Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
Department of Pediatrics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
Biometrics. 2023 Jun;79(2):1187-1200. doi: 10.1111/biom.13660. Epub 2022 Mar 30.
Many biomedical studies collect data of mixed types of variables from multiple groups of subjects. Some of these studies aim to find the group-specific and the common variation among all these variables. Even though similar problems have been studied by some previous works, their methods mainly rely on the Pearson correlation, which cannot handle mixed data. To address this issue, we propose a latent mixed Gaussian copula (LMGC) model that can quantify the correlations among binary, ordinal, continuous, and truncated variables in a unified framework. We also provide a tool to decompose the variation into the group-specific and the common variation over multiple groups via solving a regularized M-estimation problem. We conduct extensive simulation studies to show the advantage of our proposed method over the Pearson correlation-based methods. We also demonstrate that by jointly solving the M-estimation problem over multiple groups, our method is better than decomposing the variation group by group. We also apply our method to a Chlamydia trachomatis genital tract infection study to demonstrate how it can be used to discover informative biomarkers that differentiate patients.
许多生物医学研究从多个组的受试者中收集混合类型变量的数据。其中一些研究旨在找到所有这些变量中特定于组的和共同的变化。尽管以前的一些工作已经研究了类似的问题,但他们的方法主要依赖于 Pearson 相关系数,而无法处理混合数据。为了解决这个问题,我们提出了一个潜在的混合高斯 Copula (LMGC) 模型,可以在统一的框架中量化二进制、有序、连续和截断变量之间的相关性。我们还提供了一种工具,通过解决正则化 M 估计问题,将变异分解为多个组的特定于组的和共同的变异。我们进行了广泛的模拟研究,以显示我们提出的方法相对于基于 Pearson 相关系数的方法的优势。我们还表明,通过联合解决多个组的 M 估计问题,我们的方法优于逐个组分解变异。我们还将我们的方法应用于沙眼衣原体生殖道感染研究,以展示如何使用它来发现区分患者的信息生物标志物。