School of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK.
Bioinformatics. 2013 Dec 1;29(23):3060-6. doi: 10.1093/bioinformatics/btt537. Epub 2013 Sep 16.
Despite much dynamical cellular behaviour being achieved by accurate regulation of protein concentrations, messenger RNA abundances, measured by microarray technology, and more recently by deep sequencing techniques, are widely used as proxies for protein measurements. Although for some species and under some conditions, there is good correlation between transcriptome and proteome level measurements, such correlation is by no means universal due to post-transcriptional and post-translational regulation, both of which are highly prevalent in cells. Here, we seek to develop a data-driven machine learning approach to bridging the gap between these two levels of high-throughput omic measurements on Saccharomyces cerevisiae and deploy the model in a novel way to uncover mRNA-protein pairs that are candidates for post-translational regulation.
The application of feature selection by sparsity inducing regression (l₁ norm regularization) leads to a stable set of features: i.e. mRNA, ribosomal occupancy, ribosome density, tRNA adaptation index and codon bias while achieving a feature reduction from 37 to 5. A linear predictor used with these features is capable of predicting protein concentrations fairly accurately (R² = 0.86). Proteins whose concentration cannot be predicted accurately, taken as outliers with respect to the predictor, are shown to have annotation evidence of post-translational modification, significantly more than random subsets of similar size P < 0.02. In a data mining sense, this work also shows a wider point that outliers with respect to a learning method can carry meaningful information about a problem domain.
尽管通过精确调节蛋白质浓度、通过微阵列技术和最近的深度测序技术测量的信使 RNA 丰度,可以实现许多动态细胞行为,但 RNA 丰度仍被广泛用作蛋白质测量的替代物。尽管对于某些物种和某些条件下,转录组和蛋白质组水平的测量之间存在很好的相关性,但由于转录后和翻译后调节,这种相关性并非普遍存在,细胞中这两种调节方式非常普遍。在这里,我们寻求开发一种数据驱动的机器学习方法来弥合这两种高通量组学测量之间的差距,应用于酿酒酵母,并以一种新的方式揭示候选翻译后调节的 mRNA-蛋白质对。
稀疏回归(l₁ 范数正则化)的特征选择应用导致了一组稳定的特征:即 mRNA、核糖体占有率、核糖体密度、tRNA 适应指数和密码子偏性,同时将特征数量从 37 个减少到 5 个。使用这些特征的线性预测器能够相当准确地预测蛋白质浓度(R² = 0.86)。不能准确预测蛋白质浓度的蛋白质,被视为预测器的异常值,与类似大小的随机子集相比,具有翻译后修饰的注释证据,差异显著(P < 0.02)。从数据挖掘的角度来看,这项工作还表明,对于学习方法的异常值可以携带有关问题领域的有意义信息。