Narduzzi Luca, Stanstrup Jan, Mattivi Fulvio, Franceschi Pietro
a Research and Innovation Centre , Fondazione Edmund Mach (FEM) , San Michele all'Adige , Italy.
b Department of Plant and Environmental Sciences, Faculty of Science , University of Copenhagen , Copenhagen , Denmark.
Food Addit Contam Part A Chem Anal Control Expo Risk Assess. 2018 Nov;35(11):2145-2157. doi: 10.1080/19440049.2018.1523572. Epub 2018 Oct 23.
Compound identification is the main hurdle in LC-HRMS-based metabolomics, given the high number of 'unknown' metabolites. In recent years, numerous in silico fragmentation simulators have been developed to simplify and improve mass spectral interpretation and compound annotation. Nevertheless, expert mass spectrometry users and chemists are still needed to select the correct entry from the numerous candidates proposed by automatic tools, especially in the plant kingdom due to the huge structural diversity of natural compounds occurring in plants. In this work, we propose the use of a supervised machine learning approach to predict molecular substructures from isotopic patterns, training the model on a large database of grape metabolites. This approach, called 'Compounds Characteristics Comparison' (CCC) emulates the experience of a plant chemist who 'gains experience' from a (proof-of-principle) dataset of grape compounds. The results show that the CCC approach is able to predict with good accuracy most of the sub-structures proposed. In addition, after querying MS/MS spectra in Metfrag 2.2 and applying CCC predictions as scoring terms with real data, the CCC approach helped to give a better ranking to the correct candidates, improving users' confidence in candidate selection. Our results demonstrated that the proposed approach can complement current identification strategies based on fragmentation simulators and formula calculators, assisting compound identification. The CCC algorithm is freely available as R package (https://github.com/lucanard/CCC) which includes a seamless integration with Metfrag. The CCC package also permits uploading additional training data, which can be used to extend the proposed approach to other systems biological matrices. List of abbreviations: Acidic: acidic moiety; aliph: aliphatic chain; AUC: area under the ROC curve; bs: best glycosidic structure; CCC: Compounds' Characteristics Comparison; Cees: Carbons estimation errors; CO: Carbon to Oxygen ratio; Het: Heterocyclic moiety; IMD: Isotopic Mass Defect (and Pattern); LC-HRMS: Liquid Chromatography - High Resolution Mass Spectrometry; md: mass defect; MM: Monoisotopic Mass; MS: Mass Spectrometry; MSE: Mean Squared Error; nC: number of Carbons; NN: Nitrogen; pC: percentage of Carbon mass on the total mass; Pho: Phosphate; PLSr: Partial Least Square regression; ppm: parts per million; QSRR: Quantitative structure-retention relationship; RMD: Relative Mass Defect; ROC: Receiver Operating Characteristics; rRMD: residual Relative Mass Defect; RT: retention time; Sul: Sulphur; UPLC-ESI-Q-TOF-MS: Ultra Performance Liquid Chromatography - ElectroSpray Ionization -Quadropole - Time of Flight - Mass Spectrometry; VAT: Vitis arizonica Texas.
鉴于“未知”代谢物数量众多,化合物鉴定是基于液相色谱-高分辨质谱的代谢组学中的主要障碍。近年来,已开发出许多计算机模拟碎片化模拟器,以简化和改进质谱解释及化合物注释。然而,仍需要专业的质谱用户和化学家从自动工具提出的众多候选物中选择正确的条目,特别是在植物领域,因为植物中天然化合物的结构具有巨大的多样性。在这项工作中,我们提出使用监督机器学习方法,根据同位素模式预测分子子结构,并在一个大型葡萄代谢物数据库上训练该模型。这种方法称为“化合物特征比较”(CCC),它模拟了植物化学家从葡萄化合物的(原理验证)数据集中“积累经验”的过程。结果表明,CCC方法能够以较高的准确率预测大多数提出的子结构。此外,在Metfrag 2.2中查询MS/MS光谱并将CCC预测作为真实数据的评分项后,CCC方法有助于为正确的候选物给出更好的排名,提高用户在候选物选择上的信心。我们的结果表明,所提出的方法可以补充当前基于碎片化模拟器和分子式计算器的鉴定策略,辅助化合物鉴定。CCC算法可作为R包免费获取(https://github.com/lucanard/CCC),该包与Metfrag无缝集成。CCC包还允许上传额外的训练数据,可用于将所提出的方法扩展到其他系统生物学基质。缩写列表:酸性:酸性部分;脂肪族:脂肪族链;AUC:ROC曲线下面积;bs:最佳糖苷结构;CCC:化合物特征比较;Cees:碳估计误差;CO:碳氧比;杂环:杂环部分;IMD:同位素质量缺陷(及模式);LC-HRMS:液相色谱-高分辨质谱;md:质量缺陷;MM:单同位素质量;MS:质谱;MSE:均方误差;nC:碳数;NN:氮;pC:碳质量占总质量的百分比;Pho:磷酸盐;PLSr:偏最小二乘回归;ppm:百万分之一;QSRR:定量结构-保留关系;RMD:相对质量缺陷;ROC:受试者工作特征;rRMD:残余相对质量缺陷;RT:保留时间;Sul:硫;UPLC-ESI-Q-TOF-MS:超高效液相色谱-电喷雾电离-四极杆-飞行时间-质谱;VAT:亚利桑那葡萄德州种