利用机器学习预测缺失的蛋白质组学值：利用转录组学和其他生物学特征填补空白

Predicting missing proteomics values using machine learning: Filling the gap using transcriptomics and other biological features.

作者信息

Ochoteco Asensio Juan, Verheijen Marcha, Caiment Florian

机构信息

Department of Toxicogenomics, School of Oncology and Developmental Biology (GROW), Maastricht University, Maastricht, The Netherlands.

出版信息

Comput Struct Biotechnol J. 2022 Apr 22;20:2057-2069. doi: 10.1016/j.csbj.2022.04.017. eCollection 2022.

DOI:10.1016/j.csbj.2022.04.017

PMID:35601960

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9077535/

Abstract

Proteins are often considered the main biological element in charge of the different functions and structures of a cell. However, proteomics, the global study of all expressed proteins, often performed by mass spectrometry, is limited by its stochastic sampling and can only quantify a limited amount of protein per sample. Transcriptomics, which allows an exhaustive analysis of all expressed transcripts, is often used as a surrogate. However, the transcript level does not present a high level of correlation with the corresponding protein level, notably due to the existence of several post-transcriptional regulatory mechanisms. In this publication, we hypothesize that the missing protein values in proteomics could be predicted using machine learning regression methods, trained with many features extracted from transcriptomics, including known translational regulatory elements such as microRNAs and circular RNAs. After considering different machine learning algorithms applied on two different splitting strategies, we report that random forest can predict proteins in new samples out of transcriptomics data with good accuracy. The proposed pre-processing and model building scripts can be accessed on GitHub: https://github.com/jochotecoa/ml_proteomics.

摘要

蛋白质通常被认为是负责细胞不同功能和结构的主要生物元素。然而，蛋白质组学，即对所有表达蛋白质的全面研究，通常通过质谱法进行，受到其随机采样的限制，每个样本只能定量有限数量的蛋白质。转录组学允许对所有表达的转录本进行详尽分析，常被用作替代方法。然而，转录水平与相应蛋白质水平的相关性并不高，特别是由于存在多种转录后调控机制。在本出版物中，我们假设蛋白质组学中缺失的蛋白质值可以使用机器学习回归方法进行预测，这些方法通过从转录组学中提取的许多特征进行训练，包括已知的翻译调控元件，如微小RNA和环状RNA。在考虑了应用于两种不同拆分策略的不同机器学习算法后，我们报告随机森林可以从转录组学数据中准确预测新样本中的蛋白质。所提出的预处理和模型构建脚本可在GitHub上获取：https://github.com/jochotecoa/ml_proteomics。

相似文献

Predicting missing proteomics values using machine learning: Filling the gap using transcriptomics and other biological features.利用机器学习预测缺失的蛋白质组学值：利用转录组学和其他生物学特征填补空白

Comput Struct Biotechnol J. 2022 Apr 22;20:2057-2069. doi: 10.1016/j.csbj.2022.04.017. eCollection 2022.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

PCirc: random forest-based plant circRNA identification software.PCirc：基于随机森林的植物 circRNA 鉴定软件。

BMC Bioinformatics. 2021 Jan 6;22(1):10. doi: 10.1186/s12859-020-03944-1.

Identifying Circular RNA and Predicting Its Regulatory Interactions by Machine Learning.通过机器学习识别环状RNA并预测其调控相互作用

Front Genet. 2020 Jul 21;11:655. doi: 10.3389/fgene.2020.00655. eCollection 2020.

miRLocator: A Python Implementation and Web Server for Predicting miRNAs from Pre-miRNA Sequences.miRLocator：一种用于从pre-miRNA序列预测miRNA的Python实现及网络服务器。

Methods Mol Biol. 2019;1932:89-97. doi: 10.1007/978-1-4939-9042-9_6.

ProFeatX: A parallelized protein feature extraction suite for machine learning.ProFeatX：用于机器学习的并行化蛋白质特征提取套件。

Comput Struct Biotechnol J. 2022 Dec 29;21:796-801. doi: 10.1016/j.csbj.2022.12.044. eCollection 2023.

Gap-filling approaches for eddy covariance methane fluxes: A comparison of three machine learning algorithms and a traditional method with principal component analysis.涡度相关甲烷通量的填补方法：三种机器学习算法和一种传统方法与主成分分析的比较。

Glob Chang Biol. 2020 Mar;26(3):1499-1518. doi: 10.1111/gcb.14845. Epub 2019 Oct 21.

Prediction of atherosclerosis using machine learning based on operations research.基于运筹学的机器学习预测动脉粥样硬化。

Math Biosci Eng. 2022 Mar 14;19(5):4892-4910. doi: 10.3934/mbe.2022229.

Pathway analysis and transcriptomics improve protein identification by shotgun proteomics from samples comprising small number of cells--a benchmarking study.通路分析和转录组学可改善来自少量细胞样本的鸟枪法蛋白质组学的蛋白质鉴定——一项基准研究。

BMC Genomics. 2014;15 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2164-15-S9-S1. Epub 2014 Dec 8.

Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks.基于条件瓦瑟斯坦生成对抗网络的多种蛋白质赖氨酸修饰位点预测与分析

BMC Bioinformatics. 2021 Mar 31;22(1):171. doi: 10.1186/s12859-021-04101-y.

引用本文的文献

Progress and trends on machine learning in proteomics during 1997-2024: a bibliometric analysis.1997 - 2024年蛋白质组学中机器学习的进展与趋势：文献计量分析

Front Med (Lausanne). 2025 Aug 15;12:1594442. doi: 10.3389/fmed.2025.1594442. eCollection 2025.

Exosomes derived let-7f-5p is a potential biomarker of SLE with anti-inflammatory function.外泌体来源的let-7f-5p是一种具有抗炎功能的系统性红斑狼疮潜在生物标志物。

Noncoding RNA Res. 2025 Feb 21;12:116-131. doi: 10.1016/j.ncrna.2025.02.004. eCollection 2025 Jun.

PEPerMINT: peptide abundance imputation in mass spectrometry-based proteomics using graph neural networks.PEPerMINT：基于图神经网络的质谱蛋白质组学中肽丰度推断。

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii70-ii78. doi: 10.1093/bioinformatics/btae389.

State of the Art of Genomic Technology in Toxicology: A Review.毒理学中基因组技术的最新进展：综述。

Int J Mol Sci. 2023 Jun 1;24(11):9618. doi: 10.3390/ijms24119618.

本文引用的文献

IceR improves proteome coverage and data completeness in global and single-cell proteomics.IceR 提高了全局和单细胞蛋白质组学中的蛋白质组覆盖度和数据完整性。

Nat Commun. 2021 Aug 9;12(1):4787. doi: 10.1038/s41467-021-25077-6.

Quantitative single-cell proteomics as a tool to characterize cellular hierarchies.定量单细胞蛋白质组学作为一种表征细胞层次结构的工具。

Nat Commun. 2021 Jun 7;12(1):3341. doi: 10.1038/s41467-021-23667-y.

Single-cell proteomic and transcriptomic analysis of macrophage heterogeneity using SCoPE2.单细胞蛋白质组学和转录组学分析使用 SCoPE2 分析巨噬细胞异质性。

Genome Biol. 2021 Jan 27;22(1):50. doi: 10.1186/s13059-021-02267-5.

Defining the carrier proteome limit for single-cell proteomics.定义单细胞蛋白质组学的载体蛋白质组限界。

Nat Methods. 2021 Jan;18(1):76-83. doi: 10.1038/s41592-020-01002-5. Epub 2020 Dec 7.

Using Deep Learning to Extrapolate Protein Expression Measurements.利用深度学习推算蛋白质表达测量值。

Proteomics. 2020 Nov;20(21-22):e2000009. doi: 10.1002/pmic.202000009. Epub 2020 Oct 16.

GC content shapes mRNA storage and decay in human cells.GC 含量影响人类细胞中 mRNA 的储存和降解。

Elife. 2019 Dec 19;8:e49708. doi: 10.7554/eLife.49708.

Evaluating False Transfer Rates from the Match-between-Runs Algorithm with a Two-Proteome Model.评估基于双蛋白质组模型的运行间匹配算法的假转移率。

J Proteome Res. 2019 Nov 1;18(11):4020-4026. doi: 10.1021/acs.jproteome.9b00492. Epub 2019 Oct 2.

DART-ID increases single-cell proteome coverage.DART-ID 提高了单细胞蛋白质组的覆盖度。

PLoS Comput Biol. 2019 Jul 1;15(7):e1007082. doi: 10.1371/journal.pcbi.1007082. eCollection 2019 Jul.

Integration of transcriptome and proteome profiles in glioblastoma: looking for the missing link.在胶质母细胞瘤中整合转录组和蛋白质组图谱：寻找缺失的环节。

BMC Mol Biol. 2018 Nov 21;19(1):13. doi: 10.1186/s12867-018-0115-6.

Compositional Proteomics: Effects of Spatial Constraints on Protein Quantification Utilizing Isobaric Tags.组合蛋白质组学：利用等压标签研究空间限制对蛋白质定量的影响。

J Proteome Res. 2018 Jan 5;17(1):590-599. doi: 10.1021/acs.jproteome.7b00699. Epub 2017 Dec 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用机器学习预测缺失的蛋白质组学值：利用转录组学和其他生物学特征填补空白

Predicting missing proteomics values using machine learning: Filling the gap using transcriptomics and other biological features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献