Raouraoua Nessim, Lensink Marc F, Brysbaert Guillaume
Univ. Lille, CNRS UMR 8576-UGSF-Unité de Glycobiologie Structurale et Fonctionnelle, Lille, France.
Proteins. 2025 Aug 28. doi: 10.1002/prot.70040.
Massive sampling with AlphaFold2 has become a widely used approach in protein structure prediction. Here we present the MassiveFold CASP16-CAPRI dataset, a systematic, large-scale sampling of both monomeric and multimeric protein targets. By exploiting maximal parallelization, we produced up to 8040 models per target and shared them with the community for collaborative selection and scoring. This collective effort minimizes redundant computation and environmental impact, while granting resource-limited groups - especially those focused on scoring - access to high quality structures. In our analysis, we define an interface-difficulty classification based on DockQ metrics, showing that massive sampling yields the greatest gains on most of the challenging interfaces. Crucially, this classification can be predicted from the median ipTM scores of a routine AF2 run, enabling users to selectively deploy massive sampling only when it is most needed. Combined with a reduction of the massive sampling from 8040 to 2475 predictions, such targeted strategies dramatically cut computation time and resource use with minimal loss of accuracy. Finally, we underscore the persistent challenge of choosing optimal models from massive sampling datasets, emphasizing the need for more robust scoring methods. The MassiveFold datasets, together with AlphaFold ranking scores and CASP and CAPRI assessment metrics, are publicly available at https://github.com/GBLille/CASP16-CAPRI_MassiveFold_Data to accelerate further progress in protein structure prediction and assembly modeling.
使用AlphaFold2进行大规模采样已成为蛋白质结构预测中一种广泛使用的方法。在此,我们展示了MassiveFold CASP16-CAPRI数据集,这是一个对单体和多聚体蛋白质靶点进行的系统、大规模采样。通过利用最大并行化,我们为每个靶点生成了多达8040个模型,并与社区共享以进行协作选择和评分。这种集体努力最大限度地减少了冗余计算和环境影响,同时使资源有限的团队——尤其是那些专注于评分的团队——能够获得高质量的结构。在我们的分析中,我们基于DockQ指标定义了一种界面难度分类,表明大规模采样在大多数具有挑战性的界面上产生了最大的收益。至关重要的是,这种分类可以从常规AF2运行的中位数ipTM分数预测出来,使用户能够仅在最需要时选择性地部署大规模采样。结合将大规模采样从8040次预测减少到2475次,这种有针对性的策略在准确性损失最小的情况下显著减少了计算时间和资源使用。最后,我们强调了从大规模采样数据集中选择最佳模型的持续挑战,强调需要更强大的评分方法。MassiveFold数据集,连同AlphaFold排名分数以及CASP和CAPRI评估指标,可在https://github.com/GBLille/CASP16-CAPRI_MassiveFold_Data上公开获取,以加速蛋白质结构预测和组装建模的进一步进展。