Department of Computer and Information Science, University of Macau, Taipa, Macau, China.
College of Mathematics and Computer Science, Fuzhou University, Fuzhou, Fujian, China.
BMC Bioinformatics. 2019 Nov 6;20(1):549. doi: 10.1186/s12859-019-3170-1.
Mass spectra are usually acquired from the Liquid Chromatography-Mass Spectrometry (LC-MS) analysis for isotope labeled proteomics experiments. In such experiments, the mass profiles of labeled (heavy) and unlabeled (light) peptide pairs are represented by isotope clusters (2D or 3D) that provide valuable information about the studied biological samples in different conditions. The core task of quality control in quantitative LC-MS experiment is to filter out low-quality peptides with questionable profiles. The commonly used methods for this problem are the classification approaches. However, the data imbalance problems in previous control methods are often ignored or mishandled. In this study, we introduced a quality control framework based on the extreme gradient boosting machine (XGBoost), and carefully addressed the imbalanced data problem in this framework.
In the XGBoost based framework, we suggest the application of the Synthetic minority over-sampling technique (SMOTE) to re-balance data and use the balanced data to train the boosted trees as the classifier. Then the classifier is applied to other data for the peptide quality assessment. Experimental results show that our proposed framework increases the reliability of peptide heavy-light ratio estimation significantly.
Our results indicate that this framework is a powerful method for the peptide quality assessment. For the feature extraction part, the extracted ion chromatogram (XIC) based features contribute to the peptide quality assessment. To solve the imbalanced data problem, SMOTE brings a much better classification performance. Finally, the XGBoost is capable for the peptide quality control. Overall, our proposed framework provides reliable results for the further proteomics studies.
质谱通常是从液相色谱-质谱(LC-MS)分析中获取的,用于同位素标记蛋白质组学实验。在这样的实验中,标记(重)和未标记(轻)肽对的质谱谱图由同位素簇(2D 或 3D)表示,这些同位素簇提供了关于不同条件下研究生物样本的有价值的信息。定量 LC-MS 实验的质量控制的核心任务是过滤掉具有可疑谱图的低质量肽。用于解决此问题的常用方法是分类方法。然而,先前控制方法中的数据不平衡问题通常被忽略或处理不当。在这项研究中,我们引入了一种基于极端梯度提升机(XGBoost)的质量控制框架,并在该框架中仔细解决了数据不平衡问题。
在基于 XGBoost 的框架中,我们建议应用合成少数过采样技术(SMOTE)来重新平衡数据,并使用平衡数据训练提升树作为分类器。然后将分类器应用于其他数据以进行肽质量评估。实验结果表明,我们提出的框架显著提高了肽重轻比估计的可靠性。
我们的结果表明,该框架是一种用于肽质量评估的强大方法。对于特征提取部分,基于提取离子色谱图(XIC)的特征有助于肽质量评估。为了解决数据不平衡问题,SMOTE 带来了更好的分类性能。最后,XGBoost 能够进行肽质量控制。总体而言,我们提出的框架为进一步的蛋白质组学研究提供了可靠的结果。