shotgun 蛋白质组学中机器学习算法的交叉验证方案。

A cross-validation scheme for machine learning algorithms in shotgun proteomics.

机构信息

Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden.

出版信息

BMC Bioinformatics. 2012;13 Suppl 16(Suppl 16):S3. doi: 10.1186/1471-2105-13-S16-S3. Epub 2012 Nov 5.

Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.

肽段通常通过将观测到的光谱与源自蛋白质数据库的肽段进行匹配，从基于质谱的蛋白质组学实验中鉴定出来。这些鉴定的错误率可以通过目标诱饵分析来估计，该分析涉及将光谱与随机化或反转的肽段进行匹配。除了估计错误率之外，诱饵搜索还可以被半监督机器学习算法用于增加置信度鉴定的肽段数量。然而，对于所有机器学习算法而言，都必须对结果进行验证，以避免过度拟合或有偏差的学习等问题，否则会产生不可靠的肽段鉴定。在这里，我们讨论了目标诱饵方法如何在用于 shotgun 蛋白质组学的机器学习中使用，重点介绍了如何通过交叉验证（机器学习中常用的验证方案）来验证结果。我们还使用模拟数据来演示所提出的交叉验证方案检测过度拟合的能力。