Departament of Genetics, Microbiology and Statistics, University of Barcelona, Barcelona Spain.
PLoS One. 2024 Jul 23;19(7):e0307482. doi: 10.1371/journal.pone.0307482. eCollection 2024.
High-throughput technologies have generated vast amounts of omic data. It is a consensus that the integration of diverse omics sources improves predictive models and biomarker discovery. However, managing multiple omics data poses challenges such as data heterogeneity, noise, high-dimensionality and missing data, especially in block-wise patterns. This study addresses the challenges of high dimensionality and block-wise missing data through a regularization and constrained-based approach. The methodology is implemented in the R package bwm for binary and continuous response variables, and applied to breast cancer and exposome multi-omics datasets, achieving strong performance even in scenarios with missing data present in all omics. In binary classification task, our proposed model achieves accuracy in the range of 86% to 92%, and F1 in the range of 68% to 79%. And, in regression task the correlation between true and predicted responses is in the range of 72% to 76%. However, there is a slight decline in performance metrics as the percentage of missing data increases. In scenarios where block-wise missing data affects multiple omics, the model performance actually surpasses that of scenarios where missing data is present in only one omics. One possible explanation for this might be that the other scenarios introduce a greater diversity of observation profiles, leading to a more robust model. Depending on the specific omics being studied, there is greater consistency in feature selection when comparing block-wise missing data scenarios.
高通量技术产生了大量的组学数据。人们普遍认为,整合多种组学源可以提高预测模型和生物标志物发现的能力。然而,管理多个组学数据存在一些挑战,如数据异质性、噪声、高维性和缺失数据,特别是在块状模式下。本研究通过正则化和约束方法解决了高维性和块状缺失数据的挑战。该方法在 R 包 bwm 中实现,用于二进制和连续响应变量,并应用于乳腺癌和暴露组多组学数据集,即使在所有组学都存在缺失数据的情况下,也能取得良好的性能。在二进制分类任务中,我们提出的模型的准确率在 86%到 92%之间,F1 值在 68%到 79%之间。在回归任务中,真实响应和预测响应之间的相关性在 72%到 76%之间。然而,随着缺失数据百分比的增加,性能指标略有下降。在块状缺失数据影响多个组学的情况下,模型性能实际上超过了仅在一个组学中存在缺失数据的情况。一种可能的解释是,其他情况下引入了更多不同的观测剖面,从而使模型更健壮。根据具体的组学研究,在比较块状缺失数据场景时,特征选择的一致性更大。