AbSet：用于机器学习应用的抗体结构标准化数据集。

AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications.

作者信息

Almeida Diego S, Almeida Matheus V, Sampaio Jean V, Gaieta Eduardo M, Costa Andrielly H S, Rabelo Francisco F A, Cavalcante César L, Sartori Geraldo R, Silva João H M

机构信息

Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.

Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Rio de Janeiro 21040-900, Brazil.

出版信息

J Chem Inf Model. 2025 May 26;65(10):4767-4774. doi: 10.1021/acs.jcim.5c00410. Epub 2025 May 11.

DOI:10.1021/acs.jcim.5c00410

PMID:40349368

Abstract

Machine learning algorithms have played a fundamental role in the development of therapeutic antibodies by being trained on data sets of sequences and/or structures. However, structural data sets remain limited, especially those that include antibody-antigen complexes. Additionally, many of the available structures are not standardized, and antibody-specific databases often do not provide molecular descriptors that could enhance ML models. To address this gap, we introduce AbSet, a curated dataset comprising over 800,000 antibody structures and corresponding molecular descriptors, including both experimentally determined and in silico-generated antibody-antigen complexes. We systematically retrieved antibody structures from the Protein Data Bank (PDB), applied rigorous standardization protocols, and expanded the dataset through large-scale protein-protein docking to generate structural variants of antibody-antigen interactions. Each model was classified as high, medium, acceptable, or incorrect quality based on structural similarity to reference experimental complexes. This classification enables both the construction of a decoy set of confirmed non-binders and the generation of high-confidence augmented structural data for machine learning applications. AbSet is publicly available via the Zenodo repository, with accompanying scripts hosted on GitHub (https://github.com/SFBBGroup/AbSet.git).

摘要

机器学习算法通过在序列和/或结构数据集上进行训练，在治疗性抗体的开发中发挥了重要作用。然而，结构数据集仍然有限，尤其是那些包含抗体-抗原复合物的数据集。此外，许多可用结构未标准化，抗体特异性数据库通常不提供可增强机器学习模型的分子描述符。为了弥补这一差距，我们引入了AbSet，这是一个经过整理的数据集，包含超过80万个抗体结构和相应的分子描述符，包括实验确定的和计算机生成的抗体-抗原复合物。我们从蛋白质数据库（PDB）中系统地检索抗体结构，应用严格的标准化协议，并通过大规模蛋白质-蛋白质对接扩展数据集，以生成抗体-抗原相互作用的结构变体。根据与参考实验复合物的结构相似性，每个模型被分类为高质量、中等质量、可接受质量或低质量。这种分类既能够构建一组经过确认的非结合诱饵集，也能够为机器学习应用生成高置信度的增强结构数据。AbSet可通过Zenodo存储库公开获取，相关脚本托管在GitHub上（https://github.com/SFBBGroup/AbSet.git）。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

AbSet：用于机器学习应用的抗体结构标准化数据集。

AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

AbSet：用于机器学习应用的抗体结构标准化数据集。

AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications.

作者信息

机构信息

出版信息

相似文献

本文引用的文献