Deangeli Giulio, Spillantini Maria Grazia, Liò Pietro
University of Cambridge, Department of Clinical Neurosciences, Clifford Allbutt Building, Hills Road, CB2 0HA Cambridge, UK.
University of Cambridge, Department of Computer Science and Technology, William Gates Building, 15 J. J. Thomson Ave, CB3 0FD Cambridge, UK.
bioRxiv. 2025 Jun 27:2025.06.22.660946. doi: 10.1101/2025.06.22.660946.
The correlation between transcriptomic (Tx) and proteomic (Px) profiles remains modest, typically around across genes and across samples, limiting the utility of transcriptomic data as a proxy for protein abundance. To address this, we introduce Proteomizer, a deep learning platform designed to infer a sample's Px landscape from its Tx and miRNomic (Mx) profiles. Trained on 8,613 matched Tx-Mx-Px samples from TCGA and CPTAC, Proteomizer achieved a Tx-Px correlation of , representing the highest performance reported to date for this task. We further developed a Monte Carlo simulation framework to evaluate the impact of proteomization on differential expression analysis. Proteomizer substantially improved the accuracy of differential gene expression detection, with p-value precision increasing by up to 62-fold, and by as much as six orders of magnitude for a subset of genes enriched in mitochondrial and ribosomal functions. However, performance gains did not generalize to unseen tissue types or datasets generated using different protocols. Finally, we applied explainable AI (XAI) techniques to identify regulatory relations contributing to Tx-Px discrepancies. Our predictions from 100 highly annotated genes were cross-compared against by a literature-based biological knowledge graph of 322 million annotations: our explainers achieved a ROC-AUC of 0.74 in predicting miRNA-gene downregulation interactions. To our knowledge, this is the first study to systematically evaluate the biological relevance, limitations, and interpretability of proteomization models, establishing Proteomizer as a state-of-the-art tool for multiomic integration and hypothesis generation.
转录组(Tx)和蛋白质组(Px)图谱之间的相关性仍然不高,通常在基因层面约为[具体数值1],在样本层面约为[具体数值2],这限制了转录组数据作为蛋白质丰度替代指标的实用性。为了解决这个问题,我们引入了Proteomizer,这是一个深度学习平台,旨在从样本的Tx和miRNA组(Mx)图谱推断其Px图谱。在来自TCGA和CPTAC的8613个匹配的Tx-Mx-Px样本上进行训练后,Proteomizer实现了Tx-Px相关性为[具体数值3],代表了迄今为止该任务所报告的最高性能。我们进一步开发了一个蒙特卡罗模拟框架,以评估蛋白质组化对差异表达分析的影响。Proteomizer显著提高了差异基因表达检测的准确性,p值精度提高了多达62倍,对于线粒体和核糖体功能富集的基因子集,提高了多达六个数量级。然而,性能提升并未推广到未见过的组织类型或使用不同协议生成的数据集。最后,我们应用可解释人工智能(XAI)技术来识别导致Tx-Px差异的调控关系。我们对100个高度注释基因的预测与一个基于文献的3.22亿条注释的生物知识图谱进行了交叉比较:我们的解释器在预测miRNA-基因下调相互作用时的ROC-AUC为0.74。据我们所知,这是第一项系统评估蛋白质组化模型的生物学相关性、局限性和可解释性的研究,将Proteomizer确立为多组学整合和假设生成的先进工具。