Barrios-Núñez Israel, Martínez-Redondo Gemma I, Medina-Burgos Patricia, Cases Ildefonso, Fernández Rosa, Rojas Ana M
Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain.
Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain.
NAR Genom Bioinform. 2024 Jul 2;6(3):lqae078. doi: 10.1093/nargab/lqae078. eCollection 2024 Sep.
Protein language models have been tested and proved to be reliable when used on curated datasets but have not yet been applied to full proteomes. Accordingly, we tested how two different machine learning-based methods performed when decoding functional information from the proteomes of selected model organisms. We found that protein language models are more precise and informative than deep learning methods for all the species tested and across the three gene ontologies studied, and that they better recover functional information from transcriptomic experiments. The results obtained indicate that these language models are likely to be suitable for large-scale annotation and downstream analyses, and we recommend a guide for their use.
蛋白质语言模型在经过整理的数据集上进行测试时已被证明是可靠的,但尚未应用于完整蛋白质组。因此,我们测试了两种基于机器学习的不同方法在从选定模式生物的蛋白质组中解码功能信息时的表现。我们发现,对于所有测试物种以及所研究的三个基因本体,蛋白质语言模型比深度学习方法更精确且信息更丰富,并且它们能更好地从转录组实验中恢复功能信息。所获得的结果表明,这些语言模型可能适用于大规模注释和下游分析,并且我们推荐了一份使用指南。