Magazzù Giuseppe, Zampieri Guido, Angione Claudio
School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK.
Department of Biology, University of Padova, Padova, Italy.
Bioinformatics. 2021 Oct 25;37(20):3546-3552. doi: 10.1093/bioinformatics/btab324.
High-throughput biological data, thanks to technological advances, have become cheaper to collect, leading to the availability of vast amounts of omic data of different types. In parallel, the in silico reconstruction and modeling of metabolic systems is now acknowledged as a key tool to complement experimental data on a large scale. The integration of these model- and data-driven information is therefore emerging as a new challenge in systems biology, with no clear guidance on how to better take advantage of the inherent multisource and multiomic nature of these data types while preserving mechanistic interpretation.
Here, we investigate different regularization techniques for high-dimensional data derived from the integration of gene expression profiles with metabolic flux data, extracted from strain-specific metabolic models, to improve cellular growth rate predictions. To this end, we propose ad-hoc extensions of previous regularization frameworks including group, view-specific and principal component regularization and experimentally compare them using data from 1143 Saccharomyces cerevisiae strains. We observe a divergence between methods in terms of regression accuracy and integration effectiveness based on the type of regularization employed. In multiomic regression tasks, when learning from experimental and model-generated omic data, our results demonstrate the competitiveness and ease of interpretation of multimodal regularized linear models compared to data-hungry methods based on neural networks.
All data, models and code produced in this work are available on GitHub at https://github.com/Angione-Lab/HybridGroupIPFLasso_pc2Lasso.
Supplementary data are available at Bioinformatics online.
由于技术进步,高通量生物学数据的收集成本变得更低,从而使得大量不同类型的组学数据得以获取。与此同时,代谢系统的计算机重建和建模如今被公认为是大规模补充实验数据的关键工具。因此,将这些基于模型和数据驱动的信息进行整合,正成为系统生物学中的一项新挑战,目前尚无明确的指导方针,以说明如何在保留机理解释的同时,更好地利用这些数据类型固有的多源和多组学特性。
在此,我们研究了不同的正则化技术,用于处理通过整合基因表达谱与代谢通量数据(从特定菌株的代谢模型中提取)而得到的高维数据,以改进细胞生长速率预测。为此,我们提出了对先前正则化框架的特殊扩展,包括组正则化、视图特定正则化和主成分正则化,并使用来自1143个酿酒酵母菌株的数据进行了实验比较。我们观察到,基于所采用的正则化类型,各方法在回归准确性和整合有效性方面存在差异。在多组学回归任务中,当从实验和模型生成的组学数据进行学习时,我们的结果表明,与基于神经网络的需要大量数据的方法相比,多模态正则化线性模型具有竞争力且易于解释。
本研究中产生的所有数据、模型和代码均可在GitHub上获取,网址为https://github.com/Angione-Lab/HybridGroupIPFLasso_pc2Lasso。
补充数据可在《生物信息学》在线版获取。