Samorodnitsky Sarah, Wendt Chris H, Lock Eric F
Division of Biostatistics, University of Minnesota, Minneapolis, 55455, MN, USA.
Fred Hutch Cancer Center, Seattle, 98109, WA, USA.
Comput Stat Data Anal. 2024 Sep;197. doi: 10.1016/j.csda.2024.107974. Epub 2024 Apr 30.
Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including "blockwise" missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.
用于多组学数据的综合因子分解方法可估计解释生物变异的因子。这些因子可作为协变量用于预测结果,并且因子分解可用于插补缺失值。然而,目前没有可用的方法为这些任务提供一个全面的统计推断和不确定性量化框架。本文提出了一种新颖的框架——贝叶斯同步因子分解(BSF),用于在概率框架内将多组学变异同时分解为联合结构和个体结构。BSF使用共轭正态先验,并且该模型的后验模式可通过求解一个结构化核范数惩罚目标来估计,该目标还能实现秩选择并激发超参数的选择。然后,BSF被扩展为在估计潜在因子的同时预测连续或二元表型,称为贝叶斯同步因子分解与预测(BSFP)。BSF和BSFP支持并发插补,即在模型拟合过程中进行插补,以及对缺失数据进行完整的后验推断,包括“分块”缺失。通过模拟表明,BSFP在恢复潜在变异结构方面具有竞争力,并证明了在预测模型中考虑估计因子分解中的不确定性的重要性。在随机缺失和非随机缺失假设下,通过模拟检验了BSF的插补性能。最后,利用BSFP基于一项关于HIV相关阻塞性肺病的研究中的支气管肺泡灌洗代谢组和蛋白质组来预测肺功能,揭示了与肺功能下降相关的多组学模式以及由共享的代谢组和蛋白质组丰度模式驱动的一组阻塞性肺病患者。