基于半监督自动编码器的蛋白质功能预测方法。

A Semi-Supervised Autoencoder-Based Approach for Protein Function Prediction.

出版信息

IEEE J Biomed Health Inform. 2022 Oct;26(10):4957-4965. doi: 10.1109/JBHI.2022.3163150. Epub 2022 Oct 4.

DOI:10.1109/JBHI.2022.3163150

Abstract

After the development of next-generation sequencing techniques, protein sequences are abundantly available. Determining the functional characteristics of these proteins is costly and time-consuming. The gap between the number of protein sequences and their corresponding functions is continuously increasing. Advanced machine-learning methods have stepped up to fill this gap. In this work, an advanced deep-learning-based approach is proposed for protein function prediction using protein sequences. A set of autoencoders is trained in a semi-supervised manner with protein sequences. Each autoencoder corresponds to a single protein function only. In particular, 932 autoencoders corresponding to 932 biological processes and 585 autoencoders corresponding to 585 molecular functions are trained separately. Reconstruction losses of each protein sample for every autoencoder are used as a feature to classify these sequences into their corresponding functions. The proposed model is tested on test protein samples and achieves promising results. This method can be easily extended to predict any number of functions having an ample amount of supporting protein sequences. All relevant codes, data and trained models are available at https://github.com/richadhanuka/PFP-Autoencoders.

摘要

在开发了下一代测序技术之后，蛋白质序列变得非常丰富。确定这些蛋白质的功能特征既昂贵又耗时。蛋白质序列的数量与其对应的功能之间的差距一直在不断扩大。先进的机器学习方法已经开始填补这一空白。在这项工作中，提出了一种基于深度学习的方法，用于使用蛋白质序列进行蛋白质功能预测。使用蛋白质序列以半监督的方式训练了一组自动编码器。每个自动编码器仅对应于单个蛋白质功能。特别是，分别训练了对应于 932 个生物学过程和 585 个分子功能的 932 个自动编码器。每个自动编码器的每个蛋白质样本的重构损失用作特征，将这些序列分类到它们对应的功能中。所提出的模型在测试蛋白质样本上进行了测试，取得了有希望的结果。该方法可以轻松扩展到预测任何数量的具有大量支持蛋白质序列的功能。所有相关代码、数据和训练模型都可在 https://github.com/richadhanuka/PFP-Autoencoders 上获得。