Zhejiang University, Hangzhou, 310058, China.
School of Life Science, Westlake University, Hangzhou, 310023, China.
Sci Rep. 2022 Mar 30;12(1):5384. doi: 10.1038/s41598-022-09432-1.
As the pervasive, standardized format for interchange and deposition of raw mass spectrometry (MS) proteomics and metabolomics data, text-based mzML is inefficiently utilized on various analysis platforms due to its sheer volume of samples and limited read/write speed. Most research on compression algorithms rarely provides flexible random file reading scheme. Database-developed solution guarantees the efficiency of random file reading, but nevertheless the efforts in compression and third-party software support are insufficient. Under the premise of ensuring the efficiency of decompression, we propose an encoding scheme "Stack-ZDPD" that is optimized for storage of raw MS data, designed for the format "Aird", a computation-oriented format with fast accessing and decoding time, where the core compression algorithm is "ZDPD". Stack-ZDPD reduces the volume of data stored in mzML format by around 80% or more, depending on the data acquisition pattern, and the compression ratio is approximately 30% compared to ZDPD for data generated using Time of Flight technology. Our approach is available on AirdPro, for file conversion and the Java-API Aird-SDK, for data parsing.
作为一种普遍存在的、标准化的格式,用于交换和存储原始质谱(MS)蛋白质组学和代谢组学数据,基于文本的 mzML 由于其庞大的样本量和有限的读写速度,在各种分析平台上的利用率很低。大多数关于压缩算法的研究很少提供灵活的随机文件读取方案。数据库开发的解决方案保证了随机文件读取的效率,但在压缩方面的努力和对第三方软件的支持仍然不足。在保证解压效率的前提下,我们提出了一种编码方案“Stack-ZDPD”,该方案针对原始 MS 数据的存储进行了优化,设计用于“Air”格式,这是一种面向计算的格式,具有快速访问和解码时间,其核心压缩算法是“ZDPD”。Stack-ZDPD 将 mzML 格式存储的数据量减少了约 80%或更多,具体取决于数据采集模式,与使用飞行时间技术生成的数据相比,压缩比约为 30%。我们的方法可用于 AirdPro 进行文件转换,以及用于数据解析的 Java-API Aird-SDK。