Toyota Technological Institute at Chicago, Chicago, IL 60637
Proc Natl Acad Sci U S A. 2019 Aug 20;116(34):16856-16865. doi: 10.1073/pnas.1821309116. Epub 2019 Aug 9.
Direct coupling analysis (DCA) for protein folding has made very good progress, but it is not effective for proteins that lack many sequence homologs, even coupled with time-consuming conformation sampling with fragments. We show that we can accurately predict interresidue distance distribution of a protein by deep learning, even for proteins with ∼60 sequence homologs. Using only the geometric constraints given by the resulting distance matrix we may construct 3D models without involving extensive conformation sampling. Our method successfully folded 21 of the 37 CASP12 hard targets with a median family size of 58 effective sequence homologs within 4 h on a Linux computer of 20 central processing units. In contrast, DCA-predicted contacts cannot be used to fold any of these hard targets in the absence of extensive conformation sampling, and the best CASP12 group folded only 11 of them by integrating DCA-predicted contacts into fragment-based conformation sampling. Rigorous experimental validation in CASP13 shows that our distance-based folding server successfully folded 17 of 32 hard targets (with a median family size of 36 sequence homologs) and obtained 70% precision on the top L/5 long-range predicted contacts. The latest experimental validation in CAMEO shows that our server predicted correct folds for 2 membrane proteins while all of the other servers failed. These results demonstrate that it is now feasible to predict correct fold for many more proteins lack of similar structures in the Protein Data Bank even on a personal computer.
直接耦合分析(DCA)在蛋白质折叠方面取得了很好的进展,但对于缺乏许多序列同源物的蛋白质,即使与片段的耗时构象采样相结合,也不是很有效。我们表明,我们可以通过深度学习准确预测蛋白质的残基间距离分布,即使对于具有约 60 个序列同源物的蛋白质也是如此。仅使用由此产生的距离矩阵给出的几何约束,我们就可以在不涉及广泛构象采样的情况下构建 3D 模型。我们的方法在 20 个中央处理器的 Linux 计算机上仅用 4 小时成功折叠了 37 个 CASP12 硬目标中的 21 个,其平均家族大小为 58 个有效序列同源物。相比之下,在没有广泛构象采样的情况下,DCA 预测的接触不能用于折叠这些硬目标中的任何一个,而最好的 CASP12 组通过将 DCA 预测的接触整合到基于片段的构象采样中,仅折叠了其中的 11 个。在 CASP13 中的严格实验验证表明,我们基于距离的折叠服务器成功折叠了 32 个硬目标中的 17 个(平均家族大小为 36 个序列同源物),并且在顶级 L/5 长程预测接触中获得了 70%的精度。在 CAMEO 中的最新实验验证表明,我们的服务器预测了 2 个膜蛋白的正确折叠,而其他所有服务器都失败了。这些结果表明,现在即使在个人计算机上,也有可能预测出许多缺乏蛋白质数据库中相似结构的蛋白质的正确折叠。