Abeer A N M Nafiz, Boroumand Mehdi, Sermadiras Isabelle, Caldwell Jenna G, Stanev Valentin, Mody Neil, Kaplan Gilad, Savery James, Croasdale-Wood Rebecca, Pouryahya Maryam
Data Science and Modelling, BioPharmaceuticals R&D, AstraZeneca, Gaithersburg, MD, USA.
Department of Electrical and Computer Engineering, Texas A&M University, R&D, AstraZeneca, College Station, TX, USA.
MAbs. 2025 Dec;17(1):2562997. doi: 10.1080/19420862.2025.2562997. Epub 2025 Sep 26.
Experimental screening for biopharmaceutical developability properties typically relies on resource-intensive, and time-consuming assays such as size exclusion chromatography (SEC). This study highlights the potential of in silico models to accelerate the screening process by exploring sequence and structure-based machine learning techniques. Specifically, we compared surrogate models based on pre-computed features extracted from sequence and predicted structure with sequence-based approaches using protein language models (PLMs) like ESM-2. In addition to different end-to-end fine-tuning strategies for PLM, we have also investigated the integration of the structural information of the antibodies into the prediction pipeline through graph neural networks (GNN). We applied these different methods for predicting protein aggregation propensity using a dataset of approximately 1200 Immunoglobulin G (IgG1) molecules. Through this empirical evaluation, our study identifies the most effective in silico approach for predicting developability properties for SEC assays, thereby adding insights to existing screening efforts for accelerating the antibody development process.
生物制药可开发性特性的实验筛选通常依赖于资源密集型且耗时的分析方法,如尺寸排阻色谱法(SEC)。本研究通过探索基于序列和结构的机器学习技术,突出了计算机模拟模型在加速筛选过程方面的潜力。具体而言,我们将基于从序列和预测结构中提取的预计算特征的替代模型与使用诸如ESM-2等蛋白质语言模型(PLM)的基于序列的方法进行了比较。除了针对PLM的不同端到端微调策略外,我们还研究了通过图神经网络(GNN)将抗体的结构信息整合到预测流程中。我们使用一个包含约1200个免疫球蛋白G(IgG1)分子的数据集,应用这些不同方法预测蛋白质聚集倾向。通过这一实证评估,我们的研究确定了预测SEC分析可开发性特性最有效的计算机模拟方法,从而为加速抗体开发过程的现有筛选工作提供了见解。