Almeida Diego S, Almeida Matheus V, Sampaio Jean V, Gaieta Eduardo M, Costa Andrielly H S, Rabelo Francisco F A, Cavalcante César L, Sartori Geraldo R, Silva João H M
Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro 21040-900, Brazil.
J Chem Inf Model. 2025 May 26;65(10):4767-4774. doi: 10.1021/acs.jcim.5c00410. Epub 2025 May 11.
Machine learning algorithms have played a fundamental role in the development of therapeutic antibodies by being trained on data sets of sequences and/or structures. However, structural data sets remain limited, especially those that include antibody-antigen complexes. Additionally, many of the available structures are not standardized, and antibody-specific databases often do not provide molecular descriptors that could enhance ML models. To address this gap, we introduce AbSet, a curated dataset comprising over 800,000 antibody structures and corresponding molecular descriptors, including both experimentally determined and in silico-generated antibody-antigen complexes. We systematically retrieved antibody structures from the Protein Data Bank (PDB), applied rigorous standardization protocols, and expanded the dataset through large-scale protein-protein docking to generate structural variants of antibody-antigen interactions. Each model was classified as high, medium, acceptable, or incorrect quality based on structural similarity to reference experimental complexes. This classification enables both the construction of a decoy set of confirmed non-binders and the generation of high-confidence augmented structural data for machine learning applications. AbSet is publicly available via the Zenodo repository, with accompanying scripts hosted on GitHub (https://github.com/SFBBGroup/AbSet.git).
机器学习算法通过在序列和/或结构数据集上进行训练,在治疗性抗体的开发中发挥了重要作用。然而,结构数据集仍然有限,尤其是那些包含抗体-抗原复合物的数据集。此外,许多可用结构未标准化,抗体特异性数据库通常不提供可增强机器学习模型的分子描述符。为了弥补这一差距,我们引入了AbSet,这是一个经过整理的数据集,包含超过80万个抗体结构和相应的分子描述符,包括实验确定的和计算机生成的抗体-抗原复合物。我们从蛋白质数据库(PDB)中系统地检索抗体结构,应用严格的标准化协议,并通过大规模蛋白质-蛋白质对接扩展数据集,以生成抗体-抗原相互作用的结构变体。根据与参考实验复合物的结构相似性,每个模型被分类为高质量、中等质量、可接受质量或低质量。这种分类既能够构建一组经过确认的非结合诱饵集,也能够为机器学习应用生成高置信度的增强结构数据。AbSet可通过Zenodo存储库公开获取,相关脚本托管在GitHub上(https://github.com/SFBBGroup/AbSet.git)。