Musib Laila, Coletti Roberta, Lopes Marta B, Mouriño Helena, Carrasquinha Eunice
Departamento de Estatística e Investigação Operacional, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisboa, 1749-016, Portugal.
CEAUL - Centro de Estatística e Aplicações, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal.
BioData Min. 2024 Oct 29;17(1):45. doi: 10.1186/s13040-024-00401-0.
High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.
We propose the Priority-Elastic net algorithm, a hierarchical regression method extending Priority-Lasso for the binary logistic regression model by incorporating a priority order for blocks of variables while fitting Elastic-net models sequentially for each block. The fitted values from each step are then used as an offset in the subsequent step. Additionally, we considered the adaptive elastic-net penalty within our priority framework to compare the results.
The Priority-Elastic net and Priority-Adaptive Elastic net algorithms were evaluated on a brain tumor dataset available from The Cancer Genome Atlas (TCGA), accounting for transcriptomics, proteomics, and clinical information measured over two glioma types: Lower-grade glioma (LGG) and glioblastoma (GBM).
Our findings suggest that the Priority-Elastic net is a highly advantageous choice for a wide range of applications. It offers moderate computational complexity, flexibility in integrating prior knowledge while introducing a hierarchical modeling perspective, and, importantly, improved stability and accuracy in predictions, making it superior to the other methods discussed. This evolution marks a significant step forward in predictive modeling, offering a sophisticated tool for navigating the complexities of multi-omics datasets in pursuit of precision medicine's ultimate goal: personalized treatment optimization based on a comprehensive array of patient-specific data. This framework can be generalized to time-to-event, Cox proportional hazards regression and multicategorical outcomes. A practical implementation of this method is available upon request in R script, complete with an example to facilitate its application.
高维组学数据整合已成为医疗行业中一个突出的途径,具有显著提升预测模型的潜力。然而,数据整合过程面临若干挑战,包括数据异质性、确定数据块优先级顺序以呈现多个数据块中包含的预测信息、评估从一个组学水平到另一个组学水平的信息流以及多重共线性。
我们提出了优先级弹性网算法,这是一种层次回归方法,通过在拟合弹性网模型时为变量块引入优先级顺序,对用于二元逻辑回归模型的优先级套索法进行了扩展。然后将每个步骤的拟合值用作后续步骤的偏移量。此外,我们在优先级框架内考虑了自适应弹性网惩罚以比较结果。
在可从癌症基因组图谱(TCGA)获取的脑肿瘤数据集上评估了优先级弹性网和优先级自适应弹性网算法,该数据集涵盖了两种胶质瘤类型(低级别胶质瘤(LGG)和成胶质细胞瘤(GBM))的转录组学、蛋白质组学和临床信息。
我们的研究结果表明,优先级弹性网在广泛的应用中是一个非常有利的选择。它具有适度的计算复杂度,在整合先验知识时具有灵活性,同时引入了层次建模视角,重要的是,在预测方面具有更高的稳定性和准确性,使其优于所讨论的其他方法。这一进展标志着预测建模向前迈出了重要一步,为应对多组学数据集的复杂性提供了一个复杂的工具,以追求精准医学的最终目标:基于一系列全面的患者特定数据进行个性化治疗优化。该框架可推广到生存时间、Cox比例风险回归和多分类结果。可根据要求提供该方法在R脚本中的实际实现,并配有示例以方便其应用。