Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455.
Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
Proc Natl Acad Sci U S A. 2021 Jun 8;118(23). doi: 10.1073/pnas.2026658118.
Proteins require high developability-quantified by expression, solubility, and stability-for robust utility as therapeutics, diagnostics, and in other biotechnological applications. Measuring traditional developability metrics is low throughput in nature, often slowing the developmental pipeline. We evaluated the ability of 10 variations of three high-throughput developability assays to predict the bacterial recombinant expression of paratope variants of the protein scaffold Gp2. Enabled by a phenotype/genotype linkage, assay performance for 10 variants was calculated via deep sequencing of populations sorted by proxied developability. We identified the most informative assay combination via cross-validation accuracy and correlation feature selection and demonstrated the ability of machine learning models to exploit nonlinear mutual information to increase the assays' predictive utility. We trained a random forest model that predicts expression from assay performance that is 35% closer to the experimental variance and trains 80% more efficiently than a model predicting from sequence information alone. Utilizing the predicted expression, we performed a site-wise analysis and predicted mutations consistent with enhanced developability. The validated assays offer the ability to identify developable proteins at unprecedented scales, reducing the bottleneck of protein commercialization.
蛋白质需要具有高可开发性——通过表达、溶解度和稳定性来衡量——才能在治疗、诊断和其他生物技术应用中得到广泛应用。测量传统的可开发性指标的通量较低,这往往会减缓研发进程。我们评估了三种高通量可开发性测定方法的 10 种变体,以预测蛋白支架 Gp2 的表位变体在细菌中的重组表达。通过表型/基因型的联系,通过对通过近似可开发性排序的群体进行深度测序,计算了 10 种变体的测定性能。我们通过交叉验证准确性和相关性特征选择确定了最具信息量的测定组合,并展示了机器学习模型利用非线性互信息来提高测定的预测实用性的能力。我们训练了一个随机森林模型,该模型根据测定性能预测表达,其与实验方差的接近程度提高了 35%,并且比仅根据序列信息预测的模型训练效率提高了 80%。利用预测的表达,我们进行了位点分析,并预测了与增强可开发性一致的突变。经过验证的测定方法提供了以空前规模识别可开发蛋白的能力,从而减少了蛋白商业化的瓶颈。