Ochoteco Asensio Juan, Verheijen Marcha, Caiment Florian
Department of Toxicogenomics, School of Oncology and Developmental Biology (GROW), Maastricht University, Maastricht, The Netherlands.
Comput Struct Biotechnol J. 2022 Apr 22;20:2057-2069. doi: 10.1016/j.csbj.2022.04.017. eCollection 2022.
Proteins are often considered the main biological element in charge of the different functions and structures of a cell. However, proteomics, the global study of all expressed proteins, often performed by mass spectrometry, is limited by its stochastic sampling and can only quantify a limited amount of protein per sample. Transcriptomics, which allows an exhaustive analysis of all expressed transcripts, is often used as a surrogate. However, the transcript level does not present a high level of correlation with the corresponding protein level, notably due to the existence of several post-transcriptional regulatory mechanisms. In this publication, we hypothesize that the missing protein values in proteomics could be predicted using machine learning regression methods, trained with many features extracted from transcriptomics, including known translational regulatory elements such as microRNAs and circular RNAs. After considering different machine learning algorithms applied on two different splitting strategies, we report that random forest can predict proteins in new samples out of transcriptomics data with good accuracy. The proposed pre-processing and model building scripts can be accessed on GitHub: https://github.com/jochotecoa/ml_proteomics.
蛋白质通常被认为是负责细胞不同功能和结构的主要生物元素。然而,蛋白质组学,即对所有表达蛋白质的全面研究,通常通过质谱法进行,受到其随机采样的限制,每个样本只能定量有限数量的蛋白质。转录组学允许对所有表达的转录本进行详尽分析,常被用作替代方法。然而,转录水平与相应蛋白质水平的相关性并不高,特别是由于存在多种转录后调控机制。在本出版物中,我们假设蛋白质组学中缺失的蛋白质值可以使用机器学习回归方法进行预测,这些方法通过从转录组学中提取的许多特征进行训练,包括已知的翻译调控元件,如微小RNA和环状RNA。在考虑了应用于两种不同拆分策略的不同机器学习算法后,我们报告随机森林可以从转录组学数据中准确预测新样本中的蛋白质。所提出的预处理和模型构建脚本可在GitHub上获取:https://github.com/jochotecoa/ml_proteomics。