宏基因组数据多组学整合的组合对组合回归分析

Composition-on-composition regression analysis for multi-omics integration of metagenomic data.

作者信息

Rios Nicholas, Shi Yuke, Chen Jun, Zhan Xiang, Xue Lingzhou, Li Qizhai

机构信息

Department of Statistics, George Mason University, Fairfax, VA 22030, United States.

State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.

出版信息

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf387.

DOI:10.1093/bioinformatics/btaf387

PMID:40650352

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12279295/

Abstract

MOTIVATION

Compositional data are frequently encountered in many disciplines, such as in next-generation sequencing experiments widely used in biomedical studies. Regression analysis with compositional data as either responses or predictors has been well studied. However, when both responses and predictors are compositional, the inventory of analysis tools is surprisingly limited, especially in the high-dimensional setting. Among the few existing methods, most of them rely on a log-ratio transformation to move compositional data from the simplex to real numbers. Yet, a serious weakness of these methods is their failure to handle the substantial fraction of zeroes observed in data collected from next-generation sequencing experiments.

RESULTS

To investigate associations between two high-dimensional multi-omics compositions, we propose a composition-on-composition (COC) regression analysis method which does not require log-ratio transformations and hence can handle zeroes in the data. To account for high dimensionality, we estimate regression coefficients using a penalized estimation equation approach. Finally, inference procedures for COC regression are also proposed. Superior performance of COC is demonstrated through both comprehensive numerical simulations and case studies.

AVAILABILITY AND IMPLEMENTATION

Source R codes to implement COC method is available at https://github.com/nrios4/COC.

摘要

动机

成分数据在许多学科中经常遇到，例如在生物医学研究中广泛使用的下一代测序实验中。以成分数据作为响应变量或预测变量的回归分析已经得到了充分研究。然而，当响应变量和预测变量都是成分数据时，分析工具的种类出人意料地有限，尤其是在高维情况下。在现有的少数几种方法中，大多数都依赖于对数比变换，以便将成分数据从单纯形转换为实数。然而，这些方法的一个严重缺点是它们无法处理从下一代测序实验收集的数据中观察到的大量零值。