Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark.
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Sci Data. 2024 Jan 23;11(1):112. doi: 10.1038/s41597-024-02922-z.
Here we provide a curated, large scale, label free mass spectrometry-based proteomics data set derived from HeLa cell lines for general purpose machine learning and analysis. Data access and filtering is a tedious task, which takes up considerable amounts of time for researchers. Therefore we provide machine based metadata for easy selection and overview along the 7,444 raw files and MaxQuant search output. For convenience, we provide three filtered and aggregated development datasets on the protein groups, peptides and precursors level. Next to providing easy to access training data, we provide a SDRF file annotating each raw file with instrument settings allowing automated reprocessing. We encourage others to enlarge this data set by instrument runs of further HeLa samples from different machine types by providing our workflows and analysis scripts.
在这里,我们提供了一个经过精心策划的、大规模的、无标签的基于质谱的蛋白质组学数据集,来源于 HeLa 细胞系,可用于通用的机器学习和分析。数据访问和筛选是一项繁琐的任务,需要研究人员花费大量的时间。因此,我们提供基于机器的元数据,方便沿着 7444 个原始文件和 MaxQuant 搜索输出进行选择和概览。为了方便起见,我们在蛋白质组、肽和前体水平上提供了三个经过过滤和聚合的开发数据集。除了提供易于访问的训练数据外,我们还提供了一个 SDRF 文件,该文件使用仪器设置注释每个原始文件,允许自动重新处理。我们鼓励其他人通过提供我们的工作流程和分析脚本,使用来自不同机器类型的进一步 HeLa 样本的仪器运行来扩大这个数据集。