Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan.
Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan.
Sci Rep. 2024 Jan 4;14(1):552. doi: 10.1038/s41598-023-51095-z.
In designing functional biological sequences with machine learning, the activity predictor tends to be inaccurate due to shortage of data. Top ranked sequences are thus unlikely to contain effective ones. This paper proposes to take prediction stability into account to provide domain experts with a reasonable list of sequences to choose from. In our approach, multiple prediction models are trained by subsampling the training set and the multi-objective optimization problem, where one objective is the average activity and the other is the standard deviation, is solved. The Pareto front represents a list of sequences with the whole spectrum of activity and stability. Using this method, we designed VHH (Variable domain of Heavy chain of Heavy chain) antibodies based on the dataset obtained from deep mutational screening. To solve multi-objective optimization, we employed our sequence design software MOQA that uses quantum annealing. By applying several selection criteria to 19,778 designed sequences, five sequences were selected for wet-lab validation. One sequence, 16 mutations away from the closest training sequence, was successfully expressed and found to possess desired binding specificity. Our whole spectrum approach provides a balanced way of dealing with the prediction uncertainty, and can possibly be applied to extensive search of functional sequences.
在使用机器学习设计具有功能的生物序列时,由于数据短缺,活动预测器往往不够准确。因此,排名靠前的序列不太可能包含有效的序列。本文提出考虑预测稳定性,为领域专家提供合理的序列列表以供选择。在我们的方法中,通过对训练集进行抽样和多目标优化问题来训练多个预测模型,其中一个目标是平均活性,另一个目标是标准偏差。Pareto 前沿表示具有活性和稳定性全谱的序列列表。使用这种方法,我们基于从深度突变筛选获得的数据集设计了 VHH(重链重链可变域)抗体。为了解决多目标优化问题,我们使用了我们的序列设计软件 MOQA,该软件使用量子退火。通过对 19778 个设计序列应用几种选择标准,选择了五个序列进行湿实验室验证。一个距离最近的训练序列有 16 个突变的序列成功表达,并发现具有所需的结合特异性。我们的全谱方法提供了一种处理预测不确定性的平衡方法,并且可能适用于广泛的功能序列搜索。