Rios Nicholas, Shi Yuke, Chen Jun, Zhan Xiang, Xue Lingzhou, Li Qizhai
Department of Statistics, George Mason University, Fairfax, VA 22030, United States.
State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf387.
Compositional data are frequently encountered in many disciplines, such as in next-generation sequencing experiments widely used in biomedical studies. Regression analysis with compositional data as either responses or predictors has been well studied. However, when both responses and predictors are compositional, the inventory of analysis tools is surprisingly limited, especially in the high-dimensional setting. Among the few existing methods, most of them rely on a log-ratio transformation to move compositional data from the simplex to real numbers. Yet, a serious weakness of these methods is their failure to handle the substantial fraction of zeroes observed in data collected from next-generation sequencing experiments.
To investigate associations between two high-dimensional multi-omics compositions, we propose a composition-on-composition (COC) regression analysis method which does not require log-ratio transformations and hence can handle zeroes in the data. To account for high dimensionality, we estimate regression coefficients using a penalized estimation equation approach. Finally, inference procedures for COC regression are also proposed. Superior performance of COC is demonstrated through both comprehensive numerical simulations and case studies.
Source R codes to implement COC method is available at https://github.com/nrios4/COC.
成分数据在许多学科中经常遇到,例如在生物医学研究中广泛使用的下一代测序实验中。以成分数据作为响应变量或预测变量的回归分析已经得到了充分研究。然而,当响应变量和预测变量都是成分数据时,分析工具的种类出人意料地有限,尤其是在高维情况下。在现有的少数几种方法中,大多数都依赖于对数比变换,以便将成分数据从单纯形转换为实数。然而,这些方法的一个严重缺点是它们无法处理从下一代测序实验收集的数据中观察到的大量零值。
为了研究两个高维多组学成分之间的关联,我们提出了一种成分对成分(COC)回归分析方法,该方法不需要对数比变换,因此可以处理数据中的零值。为了考虑高维性,我们使用惩罚估计方程方法估计回归系数。最后,还提出了COC回归的推断程序。通过全面的数值模拟和案例研究证明了COC的优越性能。
实现COC方法的R代码可在https://github.com/nrios4/COC获取。