使用蛋白质语言模型解码模式生物中的功能蛋白质组信息。

Decoding functional proteome information in model organisms using protein language models.

作者信息

Barrios-Núñez Israel, Martínez-Redondo Gemma I, Medina-Burgos Patricia, Cases Ildefonso, Fernández Rosa, Rojas Ana M

机构信息

Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain.

Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain.

出版信息

NAR Genom Bioinform. 2024 Jul 2;6(3):lqae078. doi: 10.1093/nargab/lqae078. eCollection 2024 Sep.

DOI:10.1093/nargab/lqae078

PMID:38962255

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11217674/

Abstract

Protein language models have been tested and proved to be reliable when used on curated datasets but have not yet been applied to full proteomes. Accordingly, we tested how two different machine learning-based methods performed when decoding functional information from the proteomes of selected model organisms. We found that protein language models are more precise and informative than deep learning methods for all the species tested and across the three gene ontologies studied, and that they better recover functional information from transcriptomic experiments. The results obtained indicate that these language models are likely to be suitable for large-scale annotation and downstream analyses, and we recommend a guide for their use.

摘要

蛋白质语言模型在经过整理的数据集上进行测试时已被证明是可靠的，但尚未应用于完整蛋白质组。因此，我们测试了两种基于机器学习的不同方法在从选定模式生物的蛋白质组中解码功能信息时的表现。我们发现，对于所有测试物种以及所研究的三个基因本体，蛋白质语言模型比深度学习方法更精确且信息更丰富，并且它们能更好地从转录组实验中恢复功能信息。所获得的结果表明，这些语言模型可能适用于大规模注释和下游分析，并且我们推荐了一份使用指南。