• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质中的迁移学习:评估生物信息学任务中新型蛋白质学习表示。

Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.

机构信息

Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.

出版信息

Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.

DOI:10.1093/bib/bbac232
PMID:35758229
Abstract

A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.

摘要

表示法是一种算法,用于计算数据集中样本的数值特征向量。这样的向量,也称为嵌入向量,定义了一个相对低维的空间,能够有效地编码高维数据。最近,许多基于机器学习的学习数据表示方法已经出现,并被应用于生物信息学中的多个任务。特别是,蛋白质表示学习方法整合了不同类型的蛋白质信息(序列、结构域等),在监督或无监督学习方法中,提供可用于下游任务的蛋白质序列嵌入向量。一个特别感兴趣的任务是自动预测当今发现的大量新型蛋白质的功能,这些蛋白质仍然完全没有特征。然而,尽管它很重要,但到目前为止,对于同一组大型蛋白质和非常具体和常见的生物信息学任务,还没有对现有提案的预测性能进行公平的基准研究。因此,缺乏基准研究使得社区无法使用适当的预测方法来加速蛋白质的功能特征化。在这项研究中,我们对蛋白质序列表示学习方法进行了详细的比较,解释了每种方法,并在几个生物信息学任务上与实验基准进行了比较:(i)在嵌入空间中确定蛋白质序列相似性;(ii)推断蛋白质结构域;(iii)预测基于本体的蛋白质功能。我们检查了每种表示方法相对于基准结果的优缺点。我们希望本研究的结果和讨论能够帮助社区根据手头的生物信息学任务选择最合适的基于机器学习的蛋白质表示技术。

相似文献

1
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.蛋白质中的迁移学习:评估生物信息学任务中新型蛋白质学习表示。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.
2
16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入:核苷酸序列有意义的数值特征表示形式,方便下游分析。
PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.
3
Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。
Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.
4
How to approach machine learning-based prediction of drug/compound-target interactions.如何进行基于机器学习的药物/化合物-靶点相互作用预测。
J Cheminform. 2023 Feb 6;15(1):16. doi: 10.1186/s13321-023-00689-w.
5
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。
Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.
6
Unsupervised Representation Learning for Proteochemometric Modeling.无监督表示学习在定量构效关系建模中的应用。
Int J Mol Sci. 2021 Nov 28;22(23):12882. doi: 10.3390/ijms222312882.
7
Predicting novel microRNA: a comprehensive comparison of machine learning approaches.预测新的 microRNA:机器学习方法的全面比较。
Brief Bioinform. 2019 Sep 27;20(5):1607-1620. doi: 10.1093/bib/bby037.
8
A Transferability-Based Method for Evaluating the Protein Representation Learning.一种基于可迁移性的蛋白质表示学习评估方法。
IEEE J Biomed Health Inform. 2024 May;28(5):3158-3166. doi: 10.1109/JBHI.2024.3370680. Epub 2024 May 6.
9
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
10
Graph representation learning in bioinformatics: trends, methods and applications.生物信息学中的图表示学习:趋势、方法和应用。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab340.

引用本文的文献

1
Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学:生物信息学中大型语言模型的全面综述
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.
2
Medium-sized protein language models perform well at transfer learning on realistic datasets.中等规模的蛋白质语言模型在真实数据集上的迁移学习中表现良好。
Sci Rep. 2025 Jul 1;15(1):21400. doi: 10.1038/s41598-025-05674-x.
3
Transfer learning from inorganic materials to ivory detection.从无机材料到象牙检测的迁移学习。
Sci Rep. 2025 May 3;15(1):15536. doi: 10.1038/s41598-025-98915-y.
4
Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction.用于酶委员会编号预测的蛋白质大语言模型的比较评估
BMC Bioinformatics. 2025 Feb 27;26(1):68. doi: 10.1186/s12859-025-06081-9.
5
ProCeSa: Contrast-Enhanced Structure-Aware Network for Thermostability Prediction with Protein Language Models.ProCeSa:用于蛋白质语言模型热稳定性预测的对比增强结构感知网络。
J Chem Inf Model. 2025 Mar 10;65(5):2304-2313. doi: 10.1021/acs.jcim.4c01752. Epub 2025 Feb 23.
6
Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools.序列储备的功能分析:基于蛋白质对的计算机预测工具评估
Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf035.
7
Effective Gene Expression Prediction and Optimization from Protein Sequences.基于蛋白质序列的有效基因表达预测与优化
Adv Sci (Weinh). 2025 Feb;12(8):e2407664. doi: 10.1002/advs.202407664. Epub 2025 Jan 9.
8
Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets.为提高效率而缩小规模:中型蛋白质语言模型在真实数据集的迁移学习中表现良好。
bioRxiv. 2025 Jan 28:2024.11.22.624936. doi: 10.1101/2024.11.22.624936.
9
Deep Intraclonal Analysis for the Development of Vaccines against Drug-Resistant Lineages.深入的克隆内分析有助于开发针对耐药谱系的疫苗。
Int J Mol Sci. 2024 Sep 11;25(18):9837. doi: 10.3390/ijms25189837.
10
Interpreting and visualizing pathway analyses using embedding representations with PAVER.使用PAVER的嵌入表示法解释和可视化通路分析。
Bioinformation. 2024 Jul 31;20(7):700-704. doi: 10.6026/973206300200700. eCollection 2024.