Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, United States.
MAP Bioscience, La Jolla, CA 92093, United States.
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae278.
Antibody therapeutic candidates must exhibit not only tight binding to their target but also good developability properties, especially low risk of immunogenicity.
In this work, we fit a simple generative model, SAM, to sixty million human heavy and seventy million human light chains. We show that the probability of a sequence calculated by the model distinguishes human sequences from other species with the same or better accuracy on a variety of benchmark datasets containing >400 million sequences than any other model in the literature, outperforming large language models (LLMs) by large margins. SAM can humanize sequences, generate new sequences, and score sequences for humanness. It is both fast and fully interpretable. Our results highlight the importance of using simple models as baselines for protein engineering tasks. We additionally introduce a new tool for numbering antibody sequences which is orders of magnitude faster than existing tools in the literature.
All tools developed in this study are available at https://github.com/Wang-lab-UCSD/AntPack.
抗体治疗候选物不仅必须表现出与靶标的紧密结合,还必须具有良好的可开发性,特别是低免疫原性风险。
在这项工作中,我们拟合了一个简单的生成模型 SAM,它适用于六千万个人类重链和七千万个人类轻链。我们表明,该模型计算的序列的概率在各种包含超过四亿个序列的基准数据集上,比文献中的任何其他模型都能更准确地区分人类序列和其他具有相同或更好同源性的物种序列,并且大大优于大型语言模型 (LLM)。SAM 可以对序列进行人源化、生成新序列和对序列进行人源化评分。它速度快且完全可解释。我们的结果强调了使用简单模型作为蛋白质工程任务基准的重要性。我们还引入了一种新的抗体序列编号工具,其速度比文献中的现有工具快几个数量级。
本研究中开发的所有工具均可在 https://github.com/Wang-lab-UCSD/AntPack 上获得。