State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China.
School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 26469, China.
Nucleic Acids Res. 2022 Jan 7;50(D1):D1522-D1527. doi: 10.1093/nar/gkab1081.
The rapid development of proteomics studies has resulted in large volumes of experimental data. The emergence of big data platform provides the opportunity to handle these large amounts of data. The integrated proteome resource, iProX (https://www.iprox.cn), which was initiated in 2017, has been greatly improved with an up-to-date big data platform implemented in 2021. Here, we describe the main iProX developments since its first publication in Nucleic Acids Research in 2019. First, a hyper-converged architecture with high scalability supports the submission process. A hadoop cluster can store large amounts of proteomics datasets, and a distributed, RESTful-styled Elastic Search engine can query millions of records within one second. Also, several new features, including the Universal Spectrum Identifier (USI) mechanism proposed by ProteomeXchange, RESTful Web Service API, and a high-efficiency reanalysis pipeline, have been added to iProX for better open data sharing. By the end of August 2021, 1526 datasets had been submitted to iProX, reaching a total data volume of 92.42TB. With the implementation of the big data platform, iProX can support PB-level data storage, hundreds of billions of spectra records, and second-level latency service capabilities that meet the requirements of the fast growing field of proteomics.
蛋白质组学研究的快速发展产生了大量的实验数据。大数据平台的出现为处理这些大量数据提供了机会。综合蛋白质组资源 iProX(https://www.iprox.cn)于 2017 年启动,在 2021 年实施了最新的大数据平台后,得到了极大的改善。在这里,我们描述了自 2019 年在《核酸研究》上首次发表以来 iProX 的主要进展。首先,具有高可扩展性的超融合架构支持提交过程。Hadoop 集群可以存储大量蛋白质组学数据集,分布式的、基于 RESTful 风格的 Elastic Search 引擎可以在一秒钟内查询数百万条记录。此外,还为 iProX 添加了几个新功能,包括 ProteomeXchange 提出的通用谱标识符 (USI) 机制、RESTful Web Service API 和高效再分析管道,以实现更好的开放数据共享。截至 2021 年 8 月底,已有 1526 个数据集提交到 iProX,总数据量达到 92.42TB。通过实施大数据平台,iProX 可以支持 PB 级别的数据存储、数十亿个光谱记录和二级延迟服务能力,满足蛋白质组学快速发展领域的要求。