Jiang Xinning, Jiang Xiaogang, Han Guanghui, Ye Mingliang, Zou Hanfa
National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China.
BMC Bioinformatics. 2007 Aug 31;8:323. doi: 10.1186/1471-2105-8-323.
In proteomic analysis, MS/MS spectra acquired by mass spectrometer are assigned to peptides by database searching algorithms such as SEQUEST. The assignations of peptides to MS/MS spectra by SEQUEST searching algorithm are defined by several scores including Xcorr, Delta Cn, Sp, Rsp, matched ion count and so on. Filtering criterion using several above scores is used to isolate correct identifications from random assignments. However, the filtering criterion was not favorably optimized up to now.
In this study, we implemented a machine learning approach known as predictive genetic algorithm (GA) for the optimization of filtering criteria to maximize the number of identified peptides at fixed false-discovery rate (FDR) for SEQUEST database searching. As the FDR was directly determined by decoy database search scheme, the GA based optimization approach did not require any pre-knowledge on the characteristics of the data set, which represented significant advantages over statistical approaches such as PeptideProphet. Compared with PeptideProphet, the GA based approach can achieve similar performance in distinguishing true from false assignment with only 1/10 of the processing time. Moreover, the GA based approach can be easily extended to process other database search results as it did not rely on any assumption on the data.
Our results indicated that filtering criteria should be optimized individually for different samples. The new developed software using GA provides a convenient and fast way to create tailored optimal criteria for different proteome samples to improve proteome coverage.
在蛋白质组学分析中,通过数据库搜索算法(如SEQUEST)将质谱仪获取的串联质谱(MS/MS)谱图与肽段进行匹配。SEQUEST搜索算法将肽段与MS/MS谱图的匹配由多个分数定义,包括交叉相关系数(Xcorr)、Delta Cn、Sp、Rsp、匹配离子数等。使用上述多个分数的过滤标准用于从随机匹配中分离出正确的鉴定结果。然而,到目前为止,该过滤标准尚未得到很好的优化。
在本研究中,我们实施了一种称为预测遗传算法(GA)的机器学习方法,用于优化过滤标准,以在SEQUEST数据库搜索中固定错误发现率(FDR)的情况下最大化鉴定出的肽段数量。由于FDR直接由诱饵数据库搜索方案确定,基于GA的优化方法不需要对数据集的特征有任何先验知识,这相对于诸如PeptideProphet等统计方法具有显著优势。与PeptideProphet相比,基于GA的方法在区分真假匹配方面可以实现相似的性能,且处理时间仅为其十分之一。此外,基于GA的方法可以轻松扩展以处理其他数据库搜索结果,因为它不依赖于对数据的任何假设。
我们的结果表明,过滤标准应针对不同样本进行单独优化。新开发的使用GA的软件提供了一种方便快捷的方法,可为不同的蛋白质组样本创建定制的最佳标准,以提高蛋白质组覆盖率。