Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore.
School of Biological Sciences, Nanyang Technological University, Singapore, 637551, Singapore.
Sci Data. 2023 Dec 2;10(1):858. doi: 10.1038/s41597-023-02779-8.
Mass spectrometry-based proteomics plays a critical role in current biological and clinical research. Technical issues like data integration, missing value imputation, batch effect correction and the exploration of inter-connections amongst these technical issues, can produce errors but are not well studied. Although proteomic technologies have improved significantly in recent years, this alone cannot resolve these issues. What is needed are better algorithms and data processing knowledge. But to obtain these, we need appropriate proteomics datasets for exploration, investigation, and benchmarking. To meet this need, we developed MultiPro (Multi-purpose Proteome Resource), a resource comprising four comprehensive large-scale proteomics datasets with deliberate batch effects using the latest parallel accumulation-serial fragmentation in both Data-Dependent Acquisition (DDA) and Data Independent Acquisition (DIA) modes. Each dataset contains a balanced two-class design based on well-characterized and widely studied cell lines (A549 vs K562 or HCC1806 vs HS578T) with 48 or 36 biological and technical replicates altogether, allowing for investigation of a multitude of technical issues. These datasets allow for investigation of inter-connections between class and batch factors, or to develop approaches to compare and integrate data from DDA and DIA platforms.
基于质谱的蛋白质组学在当前的生物和临床研究中起着至关重要的作用。数据集成、缺失值插补、批次效应校正等技术问题,以及这些技术问题之间的相互关系的探索,可能会产生错误,但尚未得到很好的研究。尽管近年来蛋白质组学技术有了显著的改进,但仅凭这一点并不能解决这些问题。需要更好的算法和数据处理知识。但是,要获得这些知识,我们需要探索、调查和基准测试的适当蛋白质组学数据集。为了满足这一需求,我们开发了 MultiPro(多功能蛋白质组资源),这是一个资源,包含四个综合的大规模蛋白质组数据集,使用最新的平行积累-串联碎裂在 Data-Dependent Acquisition (DDA) 和 Data Independent Acquisition (DIA) 模式下都有故意的批次效应。每个数据集都包含基于特征良好且广泛研究的细胞系(A549 与 K562 或 HCC1806 与 HS578T)的平衡两分类设计,共有 48 或 36 个生物学和技术重复,可用于研究多种技术问题。这些数据集允许研究类和批次因素之间的相互关系,或开发方法来比较和整合 DDA 和 DIA 平台的数据。