Suppr超能文献

ProtPlat:基于 FastText 的高效蛋白质分类预训练平台。

ProtPlat: an efficient pre-training platform for protein classification based on FastText.

机构信息

Department of Computer Science and Engineering, Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China.

出版信息

BMC Bioinformatics. 2022 Feb 11;23(1):66. doi: 10.1186/s12859-022-04604-2.

Abstract

BACKGROUND

For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few.

RESULTS

In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service ( https://compbio.sjtu.edu.cn/protplat ) that is accessible to the public.

CONCLUSIONS

To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat.

摘要

背景

在过去的几十年中,得益于公共数据库中蛋白质序列数据的快速增长,已经开发出许多机器学习方法,利用氨基酸序列特征来预测蛋白质的理化性质或功能。然而,预测性能往往受到标记数据不足的影响。近年来,预训练方法已在计算机视觉和自然语言处理领域得到广泛研究,以解决小样本问题,而针对蛋白质序列的特定预训练技术却很少。

结果

在本文中,我们提出了一种用于表示蛋白质序列的预训练平台 ProtPlat,它使用 Pfam 数据库来训练一个三层神经网络,然后使用下游任务的特定训练数据来微调模型。ProtPlat 可以学习到氨基酸的良好表示,同时实现高效分类。我们在三个蛋白质分类任务上进行了实验,包括 III 型分泌效应物的鉴定、亚细胞定位的预测和信号肽的识别。实验结果表明,预训练可以有效提高模型性能,ProtPlat 与最先进的预测器具有竞争力,尤其是在小数据集上。我们将 ProtPlat 平台实现为一个 Web 服务(https://compbio.sjtu.edu.cn/protplat),供公众访问。

结论

为了增强蛋白质氨基酸序列的特征表示并提高基于序列的分类任务的性能,我们开发了 ProtPlat,这是一种通用的蛋白质序列预训练平台,其特点是基于 Pfam 数据库的大规模监督训练和高效学习模型 FastText。三个下游分类任务的实验结果证明了 ProtPlat 的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/95db/8832758/6e65d235185a/12859_2022_4604_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验