Ojewole Adegoke A, Jou Jonathan D, Fowler Vance G, Donald Bruce R
1 Department of Computer Science, Duke University , Durham, North Carolina.
2 Computational Biology and Bioinformatics Program, Duke University , Durham, North Carolina.
J Comput Biol. 2018 Jul;25(7):726-739. doi: 10.1089/cmb.2017.0267. Epub 2018 Mar 13.
Computational protein design (CPD) algorithms that compute binding affinity, K, search for sequences with an energetically favorable free energy of binding. Recent work shows that three principles improve the biological accuracy of CPD: ensemble-based design, continuous flexibility of backbone and side-chain conformations, and provable guarantees of accuracy with respect to the input. However, previous methods that use all three design principles are single-sequence (SS) algorithms, which are very costly: linear in the number of sequences and thus exponential in the number of simultaneously mutable residues. To address this computational challenge, we introduce BBK*, a new CPD algorithm whose key innovation is the multisequence (MS) bound: BBK* efficiently computes a single provable upper bound to approximate K for a combinatorial number of sequences, and avoids SS computation for all provably suboptimal sequences. Thus, to our knowledge, BBK* is the first provable, ensemble-based CPD algorithm to run in time sublinear in the number of sequences. Computational experiments on 204 protein design problems show that BBK* finds the tightest binding sequences while approximating K for up to 10-fold fewer sequences than the previous state-of-the-art algorithms, which require exhaustive enumeration of sequences. Furthermore, for 51 protein-ligand design problems, BBK* provably approximates K up to 1982-fold faster than the previous state-of-the-art iMinDEE/[Formula: see text]/[Formula: see text] algorithm. Therefore, BBK* not only accelerates protein designs that are possible with previous provable algorithms, but also efficiently performs designs that are too large for previous methods.
计算蛋白质设计(CPD)算法通过计算结合亲和力K来寻找具有能量有利结合自由能的序列。最近的研究表明,有三个原则可以提高CPD的生物学准确性:基于系综的设计、主链和侧链构象的连续灵活性以及关于输入的可证明的准确性保证。然而,之前使用所有这三个设计原则的方法都是单序列(SS)算法,成本非常高:序列数量呈线性,因此对于同时可变残基的数量呈指数级。为了应对这一计算挑战,我们引入了BBK*,一种新的CPD算法,其关键创新在于多序列(MS)界:BBK能高效地为组合数量的序列计算一个可证明的单一上界来近似K,并避免对所有可证明的次优序列进行SS计算。因此,据我们所知,BBK是第一个可证明的、基于系综的CPD算法,其运行时间在序列数量上是次线性的。对204个蛋白质设计问题的计算实验表明,BBK能找到结合最紧密的序列,同时近似K时所需的序列数量比之前的最优算法少多达10倍,而之前的算法需要对序列进行穷举。此外,对于51个蛋白质 - 配体设计问题,BBK可证明地近似K的速度比之前的最优iMinDEE/[公式:见正文]/[公式:见正文]算法快多达1982倍。因此,BBK*不仅加速了之前可证明算法所能实现的蛋白质设计,还能高效地执行对于之前方法来说规模太大的设计。