Engler Hart Chloe, Preto António José, Chanana Shaurya, Healey David, Kind Tobias, Domingo-Fernández Daniel
Enveda Biosciences, Inc., 5700 Flatiron Pkwy, Boulder, CO, 80301, USA.
J Cheminform. 2024 Aug 29;16(1):105. doi: 10.1186/s13321-024-00899-w.
Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.Scientific contributionWe have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.
离子淌度与质谱联用(IM-MS)是一种很有前景的分析技术,它通过测量碰撞截面(CCS)值来增强分子表征,而碰撞截面值能反映分子的大小和形状。然而,CCS值在结构分析中的有效应用仍受到实验数据有限的限制,因此有必要开发准确的机器学习(ML)模型用于计算机模拟预测。在本研究中,我们评估了先进的图神经网络(GNN),这些网络使用迄今为止最大的公开可用数据集进行训练以预测CCS值。尽管我们的结果证实了这些模型在与其训练环境相似的化学空间内具有较高的准确性,但当应用于结构新颖的区域时,它们的性能会显著下降。这种差异引发了对计算机模拟CCS预测可靠性的担忧,并强调了发布更多公开可用CCS数据集的必要性。为了缓解这一问题,我们引入了Mol2CCS,它展示了如何通过扩展模型以纳入分子指纹、描述符和分子类型等附加特征来部分提高泛化能力。最后,我们还展示了置信模型如何通过提高CCS估计的可靠性来提供支持。
科学贡献
我们对用于预测碰撞截面的先进图神经网络进行了基准测试。我们的工作突出了这些模型在相似化学空间中训练和预测时的准确性,但也展示了在结构新颖的区域进行评估时其准确性是如何下降的。最后,我们通过提出缓解此问题的潜在方法来得出结论。