Zhang Chengxin, Liu Quancheng, Freddolino Lydia
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
Gilbert S Omenn Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48197, USA.
bioRxiv. 2025 Jul 14:2025.07.09.663945. doi: 10.1101/2025.07.09.663945.
Features extracted from sequence homologs significantly enhance the accuracy of deep learning-based protein structure prediction. Indeed, models such as AlphaFold, which extracts features from sequence homologs, generally produce more accurate protein structures compared to single sequence-based methods like ESMfold. In contrast, features from sequence homologs are seldom employed for deep learning-based protein function prediction. Although a small number of models also incorporate function labels from sequence homologs, they cannot utilize features extracted from sequence homologs that lack function labels. To address this gap, we propose EZpred, which is the first deep learning model to use unlabeled sequence homologs for protein function prediction. Starting with the target sequence and homologs identified by MMseqs2, EZpred extracts sequence features using the ESMC protein language model. These features are then fed into a deep learning model to predict the Enzyme Commission (EC) numbers of the target protein. For 753 enzymes, the F1-score of EZpred EC number prediction is 4% higher than a similar model that does not use sequence homologs and at least 10% higher that state-of-the-art EC number prediction models. These results demonstrate the strong positive impact of sequence homologs in deep learning-based enzyme function prediction.
从序列同源物中提取的特征显著提高了基于深度学习的蛋白质结构预测的准确性。事实上,像AlphaFold这样从序列同源物中提取特征的模型,与像ESMfold这样基于单序列的方法相比,通常能产生更准确的蛋白质结构。相比之下,序列同源物的特征很少用于基于深度学习的蛋白质功能预测。虽然少数模型也纳入了来自序列同源物的功能标签,但它们无法利用从缺乏功能标签的序列同源物中提取的特征。为了弥补这一差距,我们提出了EZpred,这是第一个使用未标记序列同源物进行蛋白质功能预测的深度学习模型。从通过MMseqs2识别的目标序列和同源物开始,EZpred使用ESMC蛋白质语言模型提取序列特征。然后将这些特征输入到深度学习模型中,以预测目标蛋白质的酶委员会(EC)编号。对于753种酶,EZpred的EC编号预测的F1分数比不使用序列同源物的类似模型高4%,比最先进的EC编号预测模型至少高10%。这些结果证明了序列同源物在基于深度学习的酶功能预测中的强大积极影响。