University of Missouri, Electrical Engineering & Computer Science, Columbia, MO, 65211, USA.
Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA.
Sci Data. 2023 Aug 3;10(1):509. doi: 10.1038/s41597-023-02409-3.
In this work, we expand on a dataset recently introduced for protein interface prediction (PIP), the Database of Interacting Protein Structures (DIPS), to present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for machine learning of protein interfaces. While the original DIPS dataset contains only the Cartesian coordinates for atoms contained in the protein complex along with their types, DIPS-Plus contains multiple residue-level features including surface proximities, half-sphere amino acid compositions, and new profile hidden Markov model (HMM)-based sequence features for each amino acid, providing researchers a curated feature bank for training protein interface prediction methods. We demonstrate through rigorous benchmarks that training an existing state-of-the-art (SOTA) model for PIP on DIPS-Plus yields new SOTA results, surpassing the performance of some of the latest models trained on residue-level and atom-level encodings of protein complexes to date.
在这项工作中,我们扩展了最近引入的用于蛋白质界面预测(PIP)的数据集,即相互作用蛋白质结构数据库(DIPS),以呈现 DIPS-Plus,这是一个增强的、功能丰富的 42112 个复合物数据集,用于蛋白质界面的机器学习。虽然原始的 DIPS 数据集仅包含包含在蛋白质复合物中的原子的笛卡尔坐标及其类型,但 DIPS-Plus 包含多个残基级别的特征,包括表面接近度、半球氨基酸组成以及每个氨基酸的新基于轮廓隐藏 Markov 模型(HMM)的序列特征,为研究人员提供了经过整理的功能库,用于训练蛋白质界面预测方法。我们通过严格的基准测试证明,在 DIPS-Plus 上训练现有的用于 PIP 的最先进(SOTA)模型可以产生新的 SOTA 结果,超过了迄今为止基于蛋白质复合物残基和原子编码训练的一些最新模型的性能。