School of Software and Microelectronics, Peking University, 24 Jinyuan Road, Daxing District, Beijing 102600, China.
National institutes for food and drug control, Beijing 100050, China.
Molecules. 2019 Dec 15;24(24):4590. doi: 10.3390/molecules24244590.
Mixtures analysis can provide more information than individual components. It is important to detect the different compounds in the real complex samples. However, mixtures are often disturbed by impurities and noise to influence the accuracy. Purification and denoising will cost a lot of algorithm time. In this paper, we propose a model based on convolutional neural network (CNN) which can analyze the chemical peak information in the tandem mass spectrometry (MS/MS) data. Compared with traditional analyzing methods, CNN can reduce steps in data preprocessing. This model can extract features of different compounds and classify multi-label mass spectral data. When dealing with MS data of mixtures based on the Human Metabolome Database (HMDB), the accuracy can reach at 98%. In 600 MS test data, 451 MS data were fully detected (true positive), 142 MS data were partially found (false positive), and 7 MS data were falsely predicted (true negative). In comparison, the number of true positive test data for support vector machine (SVM) with principal component analysis (PCA), deep neural network (DNN), long short-term memory (LSTM), and XGBoost respectively are 282, 293, 270, and 402; the number of false positive test data for four models are 318, 284, 198, and 168; the number of true negative test data for four models are 0, 23, 7, 132, and 30. Compared with the model proposed in other literature, the accuracy and model performance of CNN improved considerably by separating the different compounds independent MS/MS data through three-channel architecture input. By inputting MS data from different instruments, adding more offset MS data will make CNN models have stronger universality in the future.
混合物分析可以提供比单个成分更多的信息。重要的是要检测真实复杂样品中的不同化合物。然而,混合物通常会受到杂质和噪声的干扰,从而影响准确性。净化和去噪会花费大量的算法时间。在本文中,我们提出了一种基于卷积神经网络(CNN)的模型,该模型可以分析串联质谱(MS/MS)数据中的化学峰信息。与传统分析方法相比,CNN 可以减少数据预处理步骤。该模型可以提取不同化合物的特征,并对多标签质谱数据进行分类。在处理基于人类代谢物数据库(HMDB)的混合物 MS 数据时,准确率可达 98%。在 600 个 MS 测试数据中,有 451 个 MS 数据得到了完全检测(真阳性),142 个 MS 数据得到了部分发现(假阳性),7 个 MS 数据被错误预测(真阴性)。相比之下,支持向量机(SVM)与主成分分析(PCA)、深度神经网络(DNN)、长短时记忆(LSTM)和 XGBoost 的真阳性测试数据数分别为 282、293、270 和 402;四个模型的假阳性测试数据数分别为 318、284、198 和 168;四个模型的真阴性测试数据数分别为 0、23、7 和 132、30。与其他文献中提出的模型相比,通过三通道架构输入分离不同化合物的独立 MS/MS 数据,CNN 模型的准确性和模型性能得到了显著提高。通过输入来自不同仪器的 MS 数据,并添加更多偏移 MS 数据,CNN 模型在未来将具有更强的通用性。