Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706.
Morgridge Institute for Research, Madison, WI 53715.
Proc Natl Acad Sci U S A. 2021 Nov 30;118(48). doi: 10.1073/pnas.2104878118.
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.
蛋白质序列到功能的映射非常复杂,因此很难预测序列变化将如何影响蛋白质的行为和特性。我们提出了一种有监督的深度学习框架,从深度突变扫描数据中学习序列-功能映射,并对新的、未表征的序列变体进行预测。我们测试了多种神经网络架构,包括一个结合了蛋白质结构的图卷积网络,以探索网络的内部表示如何影响其学习序列-功能映射的能力。我们的有监督学习方法在性能上优于基于物理和无监督的预测方法。我们发现,能够捕捉非线性相互作用并在序列位置之间共享参数的网络对于学习序列和功能之间的关系非常重要。对训练模型的进一步分析揭示了网络学习有关蛋白质结构和机制的生物学意义信息的能力。最后,我们展示了模型在探索序列空间和设计超出训练集的新蛋白质方面的能力。我们将蛋白质 G B1 结构域(GB1)模型应用于设计一种序列,该序列与免疫球蛋白 G 的结合亲和力比野生型 GB1 高得多。