Cui Jian, Chen Qiang, Dong Xiaorui, Shang Kai, Qi Xin, Cui Hao
Department of Information Technology Shengli College, China University of Petroleum Huadong BeiEr Road #271 Dongying Shandong P. R. China
Department of Computer Science in College of Computer and Communication Engineering, China University of Petroleum Huadong Western Changjiang Road #66, Huangdao District Qingdao Shandong P. R. China.
RSC Adv. 2019 Sep 4;9(48):27874-27882. doi: 10.1039/c9ra03789f. eCollection 2019 Sep 3.
In proteomics, it is important to detect, analyze, and quantify complex peptide components and differences. The key is to match the elution time peaks (LC peaks) produced by the same peptide in replicate experiments. Warping functions are currently widely used to correct the mean of time shifts among replicates. However, they cannot reduce the ambiguity to distinguish the corresponding peak pairs and the non-corresponding ones because the time shifts are random based on each extracted-ion-chromatogram (XIC). In this paper, besides time feature, isotope distribution pattern similarity is considered. The novelty is that compared with other feature based methods including the isotope feature, the algorithm is not based on the peak profile similarity as usual, but on the isotope distribution similarity. First, the training set of peptides including the corresponding and non-corresponding peak pairs were selected from the MS/MS results. Second, we generated time difference and isotope distribution pattern similarities for each peak pair. Third, Support Vector Machine (SVM) classification was used based on the two features. Finally, the accuracy was measured along with final coverage. We first used a 10-fold cross validation to test the effectiveness of the SVM learning model. The accuracy of correct matching could reach 97%. Second, we evaluated the coverage based on the learning model, which could be from 75% to 91% in different datasets. Thus, this matching algorithm based on time and isotope distribution pattern features could provide a high accuracy and coverage for the corresponding peak identification.
在蛋白质组学中,检测、分析和量化复杂的肽成分及其差异非常重要。关键在于在重复实验中匹配同一肽产生的洗脱时间峰(液相色谱峰)。扭曲函数目前被广泛用于校正重复实验之间的时间偏移均值。然而,由于基于每个提取离子色谱图(XIC)的时间偏移是随机的,它们无法减少区分相应峰对和非相应峰对的模糊性。在本文中,除了时间特征外,还考虑了同位素分布模式相似性。新颖之处在于,与包括同位素特征在内的其他基于特征的方法相比,该算法不像通常那样基于峰轮廓相似性,而是基于同位素分布相似性。首先,从MS/MS结果中选择包括相应峰对和非相应峰对的肽训练集。其次,我们为每个峰对生成时间差和同位素分布模式相似性。第三,基于这两个特征使用支持向量机(SVM)分类。最后,测量准确率并计算最终覆盖率。我们首先使用10折交叉验证来测试SVM学习模型的有效性。正确匹配的准确率可达97%。其次,我们基于学习模型评估覆盖率,在不同数据集中覆盖率可达75%至91%。因此,这种基于时间和同位素分布模式特征的匹配算法可为相应峰的识别提供高精度和高覆盖率。