Kenlay Henry, Dreyer Frédéric A, Kovaltsuk Aleksandr, Miketa Dom, Pires Douglas, Deane Charlotte M
Exscientia, Oxford Science Park, Oxford, United Kingdom.
Department of Statistics, University of Oxford, Oxford, United Kingdom.
PLoS Comput Biol. 2024 Dec 6;20(12):e1012646. doi: 10.1371/journal.pcbi.1012646. eCollection 2024 Dec.
Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.
抗体是免疫系统产生的蛋白质,能够以高特异性和亲和力识别并中和多种抗原,是最成功的一类生物治疗药物。随着下一代测序技术的出现,近年来已收集了数十亿条抗体序列,但其在设计更优治疗药物方面的应用受到数据量庞大和复杂性的限制。为应对这一挑战,我们提出了IgBert和IgT5,它们是迄今为止开发的性能最佳的抗体特异性语言模型,能够始终如一地处理配对和未配对的可变区序列作为输入。这些模型使用观察到的抗体空间数据集中存在的超过20亿条未配对序列以及200万条轻链和重链的配对序列进行全面训练。我们表明,在与抗体工程相关的各种设计和回归任务中,我们的模型优于现有的抗体和蛋白质语言模型。这一进展标志着在利用机器学习、大规模数据集和高性能计算来加强用于治疗开发的抗体设计方面向前迈出了重要一步。