Department of Biochemistry, University of Washington, Seattle, WA, USA.
Institute for Protein Design, University of Washington, Seattle, WA, USA.
Nature. 2021 Dec;600(7889):547-552. doi: 10.1038/s41586-021-04184-w. Epub 2021 Dec 1.
There has been considerable recent progress in protein structure prediction using deep neural networks to predict inter-residue distances from amino acid sequences. Here we investigate whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occurring proteins used in training the models. We generate random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting residue-residue distance maps, which, as expected, are quite featureless. We then carry out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (Kullback-Leibler divergence) between the inter-residue distance distributions predicted by the network and background distributions averaged over all proteins. Optimization from different random starting points resulted in novel proteins spanning a wide range of sequences and predicted structures. We obtained synthetic genes encoding 129 of the network-'hallucinated' sequences, and expressed and purified the proteins in Escherichia coli; 27 of the proteins yielded monodisperse species with circular dichroism spectra consistent with the hallucinated structures. We determined the three-dimensional structures of three of the hallucinated proteins, two by X-ray crystallography and one by NMR, and these closely matched the hallucinated models. Thus, deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute alongside traditional physics-based models to the de novo design of proteins with new functions.
近年来,利用深度神经网络预测氨基酸序列的残基间距离,在蛋白质结构预测方面取得了相当大的进展。在这里,我们研究了这些网络所捕获的信息是否足够丰富,可以生成与用于训练模型的天然存在的蛋白质序列无关的新折叠蛋白质。我们生成随机氨基酸序列,并将其输入 trRosetta 结构预测网络,以预测起始残基残基距离图,正如预期的那样,这些图非常没有特征。然后,我们在氨基酸序列空间中进行蒙特卡罗采样,优化网络预测的残基间距离分布与所有蛋白质平均的背景分布之间的对比度(Kullback-Leibler 散度)。从不同的随机起点进行优化得到了跨越广泛序列和预测结构的新型蛋白质。我们获得了编码网络“幻觉”序列的 129 个合成基因,并在大肠杆菌中表达和纯化了这些蛋白质;其中 27 个蛋白质产生了具有圆二色性光谱的单分散物质,与幻觉结构一致。我们确定了三种幻觉蛋白的三维结构,其中两种通过 X 射线晶体学确定,一种通过 NMR 确定,这些结构与幻觉模型非常吻合。因此,从其序列预测天然蛋白质结构的深度网络可以被反转来设计新的蛋白质,并且这些网络和方法应该与传统基于物理的模型一起,为具有新功能的蛋白质的从头设计做出贡献。