Zhang Yang, Vitalis Andreas
Department of Biochemistry, University of Zurich, 8057 Zurich, Switzerland.
Patterns (N Y). 2025 Jan 10;6(1):101147. doi: 10.1016/j.patter.2024.101147.
True three-dimensional (3D) data are prevalent in domains such as molecular science or computer vision. In these data, machine learning models are often asked to identify objects subject to intrinsic flexibility. Our study introduces two datasets from molecular science to assess the classification robustness of common model/feature combinations. Molecules are flexible, and shapes alone offer intra-class heterogeneities that yield a high risk for confusions. By blocking training and test sets to reduce overlap, we establish a baseline requiring the trained models to abstract from shape. As training data coverage grows, all tested architectures perform better on unseen data with reduced overfitting. Empirically, 2D embeddings of voxelized data produced the best-performing models. Evidently, both featurization and task-appropriate model design are of continued importance, the latter point reinforced by comparisons to recent, more specialized models. Finally, we show that the shape abstraction learned from database samples extends to samples that are evolving explicitly in time.
真实的三维(3D)数据在分子科学或计算机视觉等领域很常见。在这些数据中,机器学习模型经常被要求识别具有内在灵活性的物体。我们的研究引入了两个来自分子科学的数据集,以评估常见模型/特征组合的分类稳健性。分子是灵活的,仅形状就会带来类内异质性,从而产生混淆的高风险。通过阻止训练集和测试集以减少重叠,我们建立了一个基线,要求训练模型从形状中抽象出来。随着训练数据覆盖范围的扩大,所有测试的架构在未见数据上表现更好,且过拟合减少。根据经验,体素化数据的二维嵌入产生了性能最佳的模型。显然,特征化和适合任务的模型设计都持续重要,通过与最近更专门的模型进行比较,后一点得到了加强。最后,我们表明从数据库样本中学到的形状抽象可以扩展到随时间明确演变的样本。