Department of Genome Sciences, University of Washington, Seattle, WA, USA.
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
Sci Data. 2024 Nov 8;11(1):1207. doi: 10.1038/s41597-024-04068-4.
Training machine learning models for tasks such as de novo sequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is based on a previously described benchmark but has been re-processed to ensure consistent data quality and enforce separation of training and test peptides.
训练机器学习模型用于从头测序或谱聚类等任务需要大量经过充分确认的光谱数据集。在这里,我们描述了一个包含 280 万条高置信肽谱匹配的数据集,这些匹配来自九个不同的物种。该数据集基于以前描述的基准进行构建,但已进行了重新处理,以确保数据质量的一致性,并强制分离训练肽和测试肽。