蛋白质中的迁移学习：评估生物信息学任务中新型蛋白质学习表示。

Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.

机构信息

Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.

出版信息

Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.

DOI:10.1093/bib/bbac232

PMID:35758229

Abstract

A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.

摘要

表示法是一种算法，用于计算数据集中样本的数值特征向量。这样的向量，也称为嵌入向量，定义了一个相对低维的空间，能够有效地编码高维数据。最近，许多基于机器学习的学习数据表示方法已经出现，并被应用于生物信息学中的多个任务。特别是，蛋白质表示学习方法整合了不同类型的蛋白质信息（序列、结构域等），在监督或无监督学习方法中，提供可用于下游任务的蛋白质序列嵌入向量。一个特别感兴趣的任务是自动预测当今发现的大量新型蛋白质的功能，这些蛋白质仍然完全没有特征。然而，尽管它很重要，但到目前为止，对于同一组大型蛋白质和非常具体和常见的生物信息学任务，还没有对现有提案的预测性能进行公平的基准研究。因此，缺乏基准研究使得社区无法使用适当的预测方法来加速蛋白质的功能特征化。在这项研究中，我们对蛋白质序列表示学习方法进行了详细的比较，解释了每种方法，并在几个生物信息学任务上与实验基准进行了比较：（i）在嵌入空间中确定蛋白质序列相似性；（ii）推断蛋白质结构域；（iii）预测基于本体的蛋白质功能。我们检查了每种表示方法相对于基准结果的优缺点。我们希望本研究的结果和讨论能够帮助社区根据手头的生物信息学任务选择最合适的基于机器学习的蛋白质表示技术。