Suppr超能文献

使用机器学习预测小分子质谱相关性质时的不确定性量化与不可靠预测标记

Uncertainty Quantification and Flagging of Unreliable Predictions in Predicting Mass Spectrometry-Related Properties of Small Molecules Using Machine Learning.

作者信息

Matyushin Dmitriy D, Burov Ivan A, Sholokhova Anastasia Yu

机构信息

A.N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31 Leninsky Prospect, GSP-1, 119071 Moscow, Russia.

出版信息

Int J Mol Sci. 2024 Dec 5;25(23):13077. doi: 10.3390/ijms252313077.

Abstract

Mass spectral identification (in particular, in metabolomics) can be refined by comparing the observed and predicted properties of molecules, such as chromatographic retention. Significant advancements have been made in predicting these values using machine learning and deep learning. Usually, model predictions do not contain any indication of the possible error (uncertainty) or only one criterion is used for this purpose. The spread of predictions of several models included in the ensemble, and the molecular similarity of the considered molecule and the most "similar" molecule from the training set, are values that allow us to estimate the uncertainty. The Euclidean distance between vectors, calculated based on real-valued molecular descriptors, can be used for the assessment of molecular similarity. Another factor indicating uncertainty is the molecule's belonging to one of the clusters (data set clustering). Together, all three factors can be used as features for the uncertainty assessment model. Classification models that predict whether a prediction belongs to the worst 15% were obtained. The area under the receiver operating curve value is in the range of 0.73-0.82 for the considered tasks: the prediction of retention indices in gas chromatography, retention times in liquid chromatography, and collision cross-sections in ion mobility spectroscopy.

摘要

质谱鉴定(特别是在代谢组学中)可以通过比较分子的观测性质和预测性质(如色谱保留)来优化。在使用机器学习和深度学习预测这些值方面已经取得了重大进展。通常,模型预测不包含任何可能误差(不确定性)的指示,或者仅使用一个标准来实现此目的。集成中包含的多个模型的预测范围,以及所考虑分子与训练集中最“相似”分子的分子相似性,是可以用来估计不确定性的值。基于实值分子描述符计算的向量之间的欧几里得距离可用于评估分子相似性。另一个表明不确定性的因素是分子属于其中一个簇(数据集聚类)。这三个因素一起可作为不确定性评估模型的特征。获得了预测一个预测是否属于最差15%的分类模型。对于所考虑的任务,即气相色谱中保留指数的预测、液相色谱中保留时间的预测以及离子淌度光谱中碰撞截面的预测,接收器操作曲线值下的面积在0.73 - 0.82范围内。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ea8/11641629/dbb08b445832/ijms-25-13077-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验