Sirota Fernanda L, Maurer-Stroh Sebastian, Li Zhi, Eisenhaber Frank, Eisenhaber Birgit
Bioinformatics Institute (BII), Agency for Science Technology and Research (ASTAR), Singapore, Singapore.
Department of Biological Sciences, National University of Singapore, Singapore, Singapore.
Front Bioeng Biotechnol. 2021 Aug 2;9:701120. doi: 10.3389/fbioe.2021.701120. eCollection 2021.
Large enzyme families such as the groups of zinc-dependent alcohol dehydrogenases (ADHs), long chain alcohol oxidases (AOxs) or amine dehydrogenases (AmDHs) with, sometimes, more than one million sequences in the non-redundant protein database and hundreds of experimentally characterized enzymes are excellent cases for protein engineering efforts aimed at refining and modifying substrate specificity. Yet, the backside of this wealth of information is that it becomes technically difficult to rationally select optimal sequence targets as well as sequence positions for mutagenesis studies. In all three cases, we approach the problem by starting with a group of experimentally well studied family members (including those with available 3D structures) and creating a structure-guided multiple sequence alignment and a modified phylogenetic tree (aka binding site tree) based just on a selection of potential substrate binding residue positions derived from experimental information (not from the full-length sequence alignment). Hereupon, the remaining, mostly uncharacterized enzyme sequences can be mapped; as a trend, sequence grouping in the tree branches follows substrate specificity. We show that this information can be used in the target selection for protein engineering work to narrow down to single suitable sequences and just a few relevant candidate positions for directed evolution towards activity for desired organic compound substrates. We also demonstrate how to find the closest thermophile example in the dataset if the engineering is aimed at achieving most robust enzymes.
大型酶家族,如锌依赖性醇脱氢酶(ADH)、长链醇氧化酶(AOX)或胺脱氢酶(AmDH)家族,在非冗余蛋白质数据库中有时有超过一百万个序列,且有数百种经过实验表征的酶,是蛋白质工程致力于优化和改变底物特异性的绝佳案例。然而,这些丰富信息带来的问题是,从技术角度而言,合理选择最佳序列靶点以及诱变研究的序列位置变得困难。在所有这三种情况下,我们解决该问题的方法是,从一组经过充分实验研究的家族成员(包括那些具有可用三维结构的成员)入手,仅基于从实验信息(而非全长序列比对)中选取的潜在底物结合残基位置,创建一个结构导向的多序列比对和一个改良的系统发育树(即结合位点树)。据此,可以映射其余大多未表征的酶序列;一般来说,树分支中的序列分组遵循底物特异性。我们表明,这些信息可用于蛋白质工程工作的靶点选择,以缩小范围至单个合适序列以及仅几个相关的候选位置,用于针对所需有机化合物底物的活性进行定向进化。我们还展示了,如果工程目标是获得最稳定的酶,如何在数据集中找到最接近的嗜热菌实例。