The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel.
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii95-ii98. doi: 10.1093/bioinformatics/btac474.
Recently, deep learning models, initially developed in the field of natural language processing (NLP), were applied successfully to analyze protein sequences. A major drawback of these models is their size in terms of the number of parameters needed to be fitted and the amount of computational resources they require. Recently, 'distilled' models using the concept of student and teacher networks have been widely used in NLP. Here, we adapted this concept to the problem of protein sequence analysis, by developing DistilProtBert, a distilled version of the successful ProtBert model. Implementing this approach, we reduced the size of the network and the running time by 50%, and the computational resources needed for pretraining by 98% relative to ProtBert model. Using two published tasks, we showed that the performance of the distilled model approaches that of the full model. We next tested the ability of DistilProtBert to distinguish between real and random protein sequences. The task is highly challenging if the composition is maintained on the level of singlet, doublet and triplet amino acids. Indeed, traditional machine-learning algorithms have difficulties with this task. Here, we show that DistilProtBert preforms very well on singlet, doublet and even triplet-shuffled versions of the human proteome, with AUC of 0.92, 0.91 and 0.87, respectively. Finally, we suggest that by examining the small number of false-positive classifications (i.e. shuffled sequences classified as proteins by DistilProtBert), we may be able to identify de novo potential natural-like proteins based on random shuffling of amino acid sequences.
最近,最初在自然语言处理 (NLP) 领域开发的深度学习模型成功地应用于分析蛋白质序列。这些模型的一个主要缺点是它们需要拟合的参数数量和所需的计算资源。最近,使用学生和教师网络概念的“蒸馏”模型已在 NLP 中得到广泛应用。在这里,我们将这个概念应用于蛋白质序列分析问题,开发了 DistilProtBert,这是成功的 ProtBert 模型的蒸馏版本。通过实施这种方法,我们将网络的大小和运行时间减少了 50%,相对于 ProtBert 模型,预训练所需的计算资源减少了 98%。使用两个已发布的任务,我们表明蒸馏模型的性能接近全模型的性能。我们接下来测试了 DistilProtBert 区分真实和随机蛋白质序列的能力。如果组成在单核苷酸、二核苷酸和三核苷酸氨基酸水平上保持不变,那么这个任务是极具挑战性的。事实上,传统的机器学习算法在这个任务上存在困难。在这里,我们表明 DistilProtBert 在单核苷酸、二核苷酸甚至三核苷酸混合版本的人类蛋白质组上表现非常出色,AUC 分别为 0.92、0.91 和 0.87。最后,我们建议通过检查少量的假阳性分类(即通过 DistilProtBert 分类为蛋白质的混合序列),我们可能能够根据氨基酸序列的随机混合识别新的潜在自然样蛋白质。