Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan.
Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo 135-0064, Japan.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab234.
Accurate variant effect prediction has broad impacts on protein engineering. Recent machine learning approaches toward this end are based on representation learning, by which feature vectors are learned and generated from unlabeled sequences. However, it is unclear how to effectively learn evolutionary properties of an engineering target protein from homologous sequences, taking into account the protein's sequence-level structure called domain architecture (DA). Additionally, no optimal protocols are established for incorporating such properties into Transformer, the neural network well-known to perform the best in natural language processing research. This article proposes DA-aware evolutionary fine-tuning, or 'evotuning', protocols for Transformer-based variant effect prediction, considering various combinations of homology search, fine-tuning and sequence vectorization strategies. We exhaustively evaluated our protocols on diverse proteins with different functions and DAs. The results indicated that our protocols achieved significantly better performances than previous DA-unaware ones. The visualizations of attention maps suggested that the structural information was incorporated by evotuning without direct supervision, possibly leading to better prediction accuracy.
准确的变异效应预测对蛋白质工程有广泛的影响。最近针对这一目标的机器学习方法基于表示学习,通过这种方法可以从无标签序列中学习和生成特征向量。然而,目前尚不清楚如何从同源序列中有效地学习工程目标蛋白质的进化特性,同时考虑到蛋白质的序列级结构,即结构域架构(DA)。此外,还没有为将这些特性纳入到在自然语言处理研究中表现最好的神经网络——Transformer 中建立最佳协议。本文提出了基于 Transformer 的变异效应预测的 DA 感知进化微调,或 'evotuning' 协议,考虑了同源搜索、微调以及序列向量化策略的各种组合。我们在具有不同功能和 DA 的各种蛋白质上进行了详尽的评估。结果表明,与以前的不考虑 DA 的协议相比,我们的协议实现了显著更好的性能。注意力图的可视化表明,结构信息通过 evotuning 进行了整合,而无需直接监督,这可能导致更好的预测准确性。