Wang Jiangxin, Wu Gang, Chen Lei, Zhang Weiwen
Laboratory of Synthetic Microbiology, School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072, People's Republic of China.
Key Laboratory of Systems Bioengineering, Ministry of Education of China, Tianjin, 300072, People's Republic of China.
Methods Mol Biol. 2016;1375:123-36. doi: 10.1007/7651_2015_242.
Integrated analysis of large-scale transcriptomic and proteomic data can provide important insights into the metabolic mechanisms underlying complex biological systems. In this chapter, we present methods to address two aspects of issues related to integrated transcriptomic and proteomic analysis. First, due to the fact that proteomic datasets are often incomplete, and integrated analysis of partial proteomic data may introduce significant bias. To address these issues, we describe a zero-inflated Poisson (ZIP)-based model to uncover the complicated relationships between protein abundances and mRNA expression levels, and then apply them to predict protein abundance for the proteins not experimentally detected. The ZIP model takes into consideration the undetected proteins by assuming that there is a probability mass at zero representing expressed proteins that were undetected owing to technical limitations. The model validity is demonstrated using biological information of operons, regulons, and pathways. Second, weak correlation between transcriptomic and proteomic datasets is often due to biological factors affecting translational processes. To quantify the effects of these factors, we describe a multiple regression-based statistical framework to quantitatively examine the effects of various translational efficiency-related sequence features on mRNA-protein correlation. Using the datasets from sulfate-reducing bacteria Desulfovibrio vulgaris, the analysis shows that translation-related sequence features can contribute up to 15.2-26.2% of the total variation of the correlation between transcriptomic and proteomic datasets, and also reveals the relative importance of various features in translation process.
大规模转录组学和蛋白质组学数据的综合分析能够为复杂生物系统背后的代谢机制提供重要见解。在本章中,我们介绍解决与转录组学和蛋白质组学综合分析相关的两个方面问题的方法。首先,由于蛋白质组学数据集往往不完整,对部分蛋白质组数据进行综合分析可能会引入显著偏差。为解决这些问题,我们描述一种基于零膨胀泊松(ZIP)的模型,以揭示蛋白质丰度与mRNA表达水平之间的复杂关系,然后将其应用于预测未通过实验检测到的蛋白质的丰度。ZIP模型通过假设在零处存在概率质量来考虑未检测到的蛋白质,该概率质量代表由于技术限制而未检测到的已表达蛋白质。使用操纵子、调控子和代谢途径的生物学信息证明了该模型的有效性。其次,转录组学和蛋白质组学数据集之间的弱相关性通常是由于影响翻译过程的生物学因素所致。为了量化这些因素的影响,我们描述了一个基于多元回归的统计框架,以定量检查各种与翻译效率相关的序列特征对mRNA-蛋白质相关性的影响。使用来自硫酸盐还原菌脱硫弧菌的数据进行分析表明,与翻译相关的序列特征对转录组学和蛋白质组学数据集之间相关性的总变异贡献可达15.2 - 26.2%,同时还揭示了翻译过程中各种特征的相对重要性。