Suppr超能文献

脱硫弧菌转录组和蛋白质组数据的综合分析:预测未检测到蛋白质丰度的非线性模型。

Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins.

机构信息

Department of Industrial, Systems and Operations Engineering, Tempe, AZ 85287-5906, USA.

出版信息

Bioinformatics. 2009 Aug 1;25(15):1905-14. doi: 10.1093/bioinformatics/btp325. Epub 2009 May 15.

Abstract

MOTIVATION

Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems.

RESULTS

In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

基因表达谱技术通常可以产生基因组中所有基因的 mRNA 丰度数据。由于蛋白质组学测量的识别范围和灵敏度落后于转录组学测量,因此仍然缺乏蛋白质组学数据。使用部分蛋白质组学数据,整合转录组学和蛋白质组学分析可能会引入显著的偏差。开发准确估计缺失蛋白质组学数据的方法将允许更好地整合转录组学和蛋白质组学数据集,并深入了解复杂生物系统的代谢机制。

结果

在这项研究中,我们提出了一种非线性数据驱动模型,使用从脱硫弧菌中收集的两个独立的同源转录组学和蛋白质组学数据集来预测未检测到的蛋白质的丰度。我们使用随机梯度增强树(GBT)来揭示转录组学和蛋白质组学数据之间可能存在的非线性关系,并根据相关预测因子(如 mRNA 丰度、细胞角色、分子量、序列长度、蛋白质长度、鸟嘌呤-胞嘧啶(GC)含量和三密码子计数)预测未实验检测到的蛋白质的丰度。最初,我们使用所有可能的变量构建了一个 GBT 模型,以评估它们的相对重要性并描述预测模型的行为。在这个模型中,在高 mRNA 值和稀疏数据的区域出现了强烈的平台效应。因此,我们根据从捕获到这种行为的部分依赖关系图中估计的阈值,从这些区域中删除基因。在这一阶段,只保留了对蛋白质丰度最强的预测因子,以降低 GBT 模型的复杂性。在去除平台区域的基因后,mRNA 丰度、主要细胞功能类别和少数三密码子计数成为蛋白质丰度的最高预测因子。然后,我们使用五个最重要的预测因子创建了一个新的调整后的 GBT 模型。我们的非线性模型的构建由一组串行回归树模型组成,这些模型具有隐含的变量选择能力。该模型使用均方误差作为标准提供变量相对重要性度量。结果表明,在两个数据集,我们的非线性模型的决定系数范围从 0.393 到 0.582,提供了比过去使用的线性回归更好的结果。我们使用操纵子、调节子和途径的生物学信息来评估这个非线性模型的有效性,结果表明,操纵子、调节子或途径内估计蛋白质丰度值的变异系数确实小于随机蛋白质组的变异系数。

补充信息

补充数据可在生物信息学在线获得。

相似文献

5
Prediction and Characterization of Missing Proteomic Data in Desulfovibrio vulgaris.
Comp Funct Genomics. 2011;2011:780973. doi: 10.1155/2011/780973. Epub 2011 May 4.
6
Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: a multiple regression to identify sources of variations.
Biochem Biophys Res Commun. 2006 Jan 13;339(2):603-10. doi: 10.1016/j.bbrc.2005.11.055. Epub 2005 Nov 17.
8
LC-MS/MS based proteomic analysis and functional inference of hypothetical proteins in Desulfovibrio vulgaris.
Biochem Biophys Res Commun. 2006 Nov 3;349(4):1412-9. doi: 10.1016/j.bbrc.2006.09.019. Epub 2006 Sep 15.

引用本文的文献

1
Workability of mRNA Sequencing for Predicting Protein Abundance.
Genes (Basel). 2023 Nov 11;14(11):2065. doi: 10.3390/genes14112065.
2
Robust Score Tests With Missing Data in Genomics Studies.
J Am Stat Assoc. 2019;114(528):1778-1786. doi: 10.1080/01621459.2018.1514304. Epub 2019 Feb 26.
3
Machine Learning and Integrative Analysis of Biomedical Big Data.
Genes (Basel). 2019 Jan 28;10(2):87. doi: 10.3390/genes10020087.
4
Proteomics and phosphoproteomics in precision medicine: applications and challenges.
Brief Bioinform. 2019 May 21;20(3):767-777. doi: 10.1093/bib/bbx141.
5
Identifying Aspects of the Post-Transcriptional Program Governing the Proteome of the Green Alga Micromonas pusilla.
PLoS One. 2016 Jul 19;11(7):e0155839. doi: 10.1371/journal.pone.0155839. eCollection 2016.
6
An integrative imputation method based on multi-omics datasets.
BMC Bioinformatics. 2016 Jun 21;17:247. doi: 10.1186/s12859-016-1122-6.
7
Genetic basis for nitrate resistance in Desulfovibrio strains.
Front Microbiol. 2014 Apr 21;5:153. doi: 10.3389/fmicb.2014.00153. eCollection 2014.
8
Predicting the dynamics of protein abundance.
Mol Cell Proteomics. 2014 May;13(5):1330-40. doi: 10.1074/mcp.M113.033076. Epub 2014 Feb 16.
9
Multi-omic network signatures of disease.
Front Genet. 2014 Jan 7;4:309. doi: 10.3389/fgene.2013.00309.
10
Integrated analysis of transcriptomic and proteomic data.
Curr Genomics. 2013 Apr;14(2):91-110. doi: 10.2174/1389202911314020003.

本文引用的文献

1
A working guide to boosted regression trees.
J Anim Ecol. 2008 Jul;77(4):802-13. doi: 10.1111/j.1365-2656.2008.01390.x. Epub 2008 Apr 8.
2
Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications.
Crit Rev Biotechnol. 2007 Apr-Jun;27(2):63-75. doi: 10.1080/07388550701334212.
3
Boosted trees for ecological modeling and prediction.
Ecology. 2007 Jan;88(1):243-51. doi: 10.1890/0012-9658(2007)88[243:btfema]2.0.co;2.
7
Global transcriptomic analysis of Desulfovibrio vulgaris on different electron donors.
Antonie Van Leeuwenhoek. 2006 Feb;89(2):221-37. doi: 10.1007/s10482-005-9024-z. Epub 2006 May 5.
8
Salt stress in Desulfovibrio vulgaris Hildenborough: an integrated genomics approach.
J Bacteriol. 2006 Jun;188(11):4068-78. doi: 10.1128/JB.01921-05.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验