Department of Genetics.
Department of Bioengineering.
Bioinformatics. 2020 Jun 1;36(11):3357-3364. doi: 10.1093/bioinformatics/btaa162.
High-throughput protein screening is a critical technique for dissecting and designing protein function. Libraries for these assays can be created through a number of means, including targeted or random mutagenesis of a template protein sequence or direct DNA synthesis. However, mutagenic library construction methods often yield vastly more nonfunctional than functional variants and, despite advances in large-scale DNA synthesis, individual synthesis of each desired DNA template is often prohibitively expensive. Consequently, many protein-screening libraries rely on the use of degenerate codons (DCs), mixtures of DNA bases incorporated at specific positions during DNA synthesis, to generate highly diverse protein-variant pools from only a few low-cost synthesis reactions. However, selecting DCs for sets of sequences that covary at multiple positions dramatically increases the difficulty of designing a DC library and leads to the creation of many undesired variants that can quickly outstrip screening capacity.
We introduce a novel algorithm for total DC library optimization, degenerate codon design (DeCoDe), based on integer linear programming. DeCoDe significantly outperforms state-of-the-art DC optimization algorithms and scales well to more than a hundred proteins sharing complex patterns of covariation (e.g. the lab-derived avGFP lineage). Moreover, DeCoDe is, to our knowledge, the first DC design algorithm with the capability to encode mixed-length protein libraries. We anticipate DeCoDe to be broadly useful for a variety of library generation problems, ranging from protein engineering attempts that leverage mutual information to the reconstruction of ancestral protein states.
github.com/OrensteinLab/DeCoDe.
Supplementary data are available at Bioinformatics online.
高通量蛋白质筛选是剖析和设计蛋白质功能的关键技术。这些测定的文库可以通过多种手段创建,包括模板蛋白序列的靶向或随机诱变或直接 DNA 合成。然而,诱变文库构建方法通常产生的无功能变体比功能变体多得多,尽管在大规模 DNA 合成方面取得了进展,但单独合成每个所需的 DNA 模板通常过于昂贵。因此,许多蛋白质筛选文库依赖于使用简并密码子(DC),即在 DNA 合成过程中特定位置掺入的 DNA 碱基混合物,仅从少数低成本合成反应中生成高度多样化的蛋白质变体池。然而,为在多个位置上共变的序列集选择 DC 极大地增加了设计 DC 文库的难度,并导致产生了许多不需要的变体,这些变体很快就会超过筛选能力。
我们介绍了一种基于整数线性规划的全新总 DC 文库优化算法,即简并密码子设计(DeCoDe)。DeCoDe 明显优于最先进的 DC 优化算法,并且可以很好地扩展到具有复杂共变模式的一百多个蛋白质(例如,实验室衍生的 avGFP 谱系)。此外,据我们所知,DeCoDe 是第一个具有编码混合长度蛋白质文库能力的 DC 设计算法。我们预计 DeCoDe 将广泛用于各种文库生成问题,从利用互信息的蛋白质工程尝试到重建祖先蛋白质状态。
github.com/OrensteinLab/DeCoDe。
补充数据可在 Bioinformatics 在线获得。