StackZDPD：一种新型质谱数据编码方案，针对速度和压缩比进行了优化。

StackZDPD: a novel encoding scheme for mass spectrometry data optimized for speed and compression ratio.

机构信息

Zhejiang University, Hangzhou, 310058, China.

School of Life Science, Westlake University, Hangzhou, 310023, China.

出版信息

Sci Rep. 2022 Mar 30;12(1):5384. doi: 10.1038/s41598-022-09432-1.

DOI:10.1038/s41598-022-09432-1

PMID:35354909

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8967824/

Abstract

As the pervasive, standardized format for interchange and deposition of raw mass spectrometry (MS) proteomics and metabolomics data, text-based mzML is inefficiently utilized on various analysis platforms due to its sheer volume of samples and limited read/write speed. Most research on compression algorithms rarely provides flexible random file reading scheme. Database-developed solution guarantees the efficiency of random file reading, but nevertheless the efforts in compression and third-party software support are insufficient. Under the premise of ensuring the efficiency of decompression, we propose an encoding scheme "Stack-ZDPD" that is optimized for storage of raw MS data, designed for the format "Aird", a computation-oriented format with fast accessing and decoding time, where the core compression algorithm is "ZDPD". Stack-ZDPD reduces the volume of data stored in mzML format by around 80% or more, depending on the data acquisition pattern, and the compression ratio is approximately 30% compared to ZDPD for data generated using Time of Flight technology. Our approach is available on AirdPro, for file conversion and the Java-API Aird-SDK, for data parsing.

摘要

作为一种普遍存在的、标准化的格式，用于交换和存储原始质谱（MS）蛋白质组学和代谢组学数据，基于文本的 mzML 由于其庞大的样本量和有限的读写速度，在各种分析平台上的利用率很低。大多数关于压缩算法的研究很少提供灵活的随机文件读取方案。数据库开发的解决方案保证了随机文件读取的效率，但在压缩方面的努力和对第三方软件的支持仍然不足。在保证解压效率的前提下，我们提出了一种编码方案“Stack-ZDPD”，该方案针对原始 MS 数据的存储进行了优化，设计用于“Air”格式，这是一种面向计算的格式，具有快速访问和解码时间，其核心压缩算法是“ZDPD”。Stack-ZDPD 将 mzML 格式存储的数据量减少了约 80%或更多，具体取决于数据采集模式，与使用飞行时间技术生成的数据相比，压缩比约为 30%。我们的方法可用于 AirdPro 进行文件转换，以及用于数据解析的 Java-API Aird-SDK。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/35c4/8967824/eb1cc1a263b2/41598_2022_9432_Fig1_HTML.jpg

相似文献

StackZDPD: a novel encoding scheme for mass spectrometry data optimized for speed and compression ratio.StackZDPD：一种新型质谱数据编码方案，针对速度和压缩比进行了优化。

Sci Rep. 2022 Mar 30;12(1):5384. doi: 10.1038/s41598-022-09432-1.

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time.Aird：一种面向计算的质谱数据格式，可实现更高的压缩比和更短的解码时间。

BMC Bioinformatics. 2022 Jan 12;23(1):35. doi: 10.1186/s12859-021-04490-0.

mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements.mzMLb：一种基于符合标准的 mzML 并针对速度和存储要求进行优化的未来证明型原始质谱数据格式。

J Proteome Res. 2021 Jan 1;20(1):172-183. doi: 10.1021/acs.jproteome.0c00192. Epub 2020 Oct 29.

pymzML v2.0: introducing a highly compressed and seekable gzip format.pymzML v2.0：引入一种高度压缩且可快速检索的 gzip 格式。

Bioinformatics. 2018 Jul 15;34(14):2513-2514. doi: 10.1093/bioinformatics/bty046.

Numerical compression schemes for proteomics mass spectrometry data.蛋白质组学质谱数据的数值压缩方案。

Mol Cell Proteomics. 2014 Jun;13(6):1537-42. doi: 10.1074/mcp.O114.037879. Epub 2014 Mar 27.

MRMPro: a web-based tool to improve the speed of manual calibration for multiple reaction monitoring data analysis by mass spectrometry.MRMPro：一个基于网络的工具，用于提高质谱法多重反应监测数据分析的手动校准速度。

BMC Bioinformatics. 2024 Feb 6;25(1):60. doi: 10.1186/s12859-024-05685-x.

Mass spectrometer output file format mzML.质谱仪输出文件格式为mzML。

Methods Mol Biol. 2010;604:319-31. doi: 10.1007/978-1-60761-444-9_22.

ms-data-core-api: an open-source, metadata-oriented library for computational proteomics.质谱数据核心应用程序编程接口：一个用于计算蛋白质组学的面向元数据的开源库。

Bioinformatics. 2015 Sep 1;31(17):2903-5. doi: 10.1093/bioinformatics/btv250. Epub 2015 Apr 24.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE：使用局部一致编码提升序列压缩算法。

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

smallWig: parallel compression of RNA-seq WIG files.smallWig：RNA序列WIG文件的并行压缩

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

引用本文的文献

Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work.蛋白质组学标准倡议二十年：当前活动和未来工作。

J Proteome Res. 2023 Feb 3;22(2):287-301. doi: 10.1021/acs.jproteome.2c00637. Epub 2023 Jan 10.

MZA: A Data Conversion Tool to Facilitate Software Development and Artificial Intelligence Research in Multidimensional Mass Spectrometry.MZA：一个数据转换工具，用于促进多维质谱中的软件开发和人工智能研究。

J Proteome Res. 2023 Feb 3;22(2):508-513. doi: 10.1021/acs.jproteome.2c00313. Epub 2022 Nov 22.

本文引用的文献

A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics.基于现代蛋白质组学采集策略的全面 LFQ 基准数据集。

Sci Data. 2022 Mar 30;9(1):126. doi: 10.1038/s41597-022-01216-6.

BMC Bioinformatics. 2022 Jan 12;23(1):35. doi: 10.1186/s12859-021-04490-0.

Uncovering the complexity of the yeast lipidome by means of nLC/NSI-MS/MS.通过nLC/NSI-MS/MS揭示酵母脂质组的复杂性。

Anal Chim Acta. 2020 Dec 15;1140:199-209. doi: 10.1016/j.aca.2020.10.012. Epub 2020 Oct 14.

QuantPipe: A User-Friendly Pipeline Software Tool for DIA Data Analysis Based on the OpenSWATH-PyProphet-TRIC Workflow.QuantPipe：一个基于 OpenSWATH-PyProphet-TRIC 工作流程的 DIA 数据分析用户友好型流程软件工具。

J Proteome Res. 2021 Jan 1;20(1):1096-1102. doi: 10.1021/acs.jproteome.0c00704. Epub 2020 Oct 22.

J Proteome Res. 2021 Jan 1;20(1):172-183. doi: 10.1021/acs.jproteome.0c00192. Epub 2020 Oct 29.

Data-Independent Acquisition Proteomics Unravels the Effects of Iron Ions on Coronatine Synthesis in pv. DC3000.数据非依赖型采集蛋白质组学揭示铁离子对丁香假单胞菌番茄致病变种DC3000中冠菌素合成的影响。

Front Microbiol. 2020 Jul 21;11:1362. doi: 10.3389/fmicb.2020.01362. eCollection 2020.

Toffee - a highly efficient, lossless file format for DIA-MS.太妃糖 - 一种用于 DIA-MS 的高效、无损文件格式。

Sci Rep. 2020 Jun 2;10(1):8939. doi: 10.1038/s41598-020-65015-y.

MassComp, a lossless compressor for mass spectrometry data.MassComp，一种用于质谱数据的无损压缩器。

BMC Bioinformatics. 2019 Jul 1;20(1):368. doi: 10.1186/s12859-019-2962-7.

Activation of unliganded FGF receptor by extracellular phosphate potentiates proteolytic protection of FGF23 by its O-glycosylation.未配位的 FGF 受体被细胞外磷酸盐激活，增强了 FGF23 的 O-糖基化对其的蛋白水解保护作用。

Proc Natl Acad Sci U S A. 2019 Jun 4;116(23):11418-11427. doi: 10.1073/pnas.1815166116. Epub 2019 May 16.

Quantitative Proteomics Combined with Affinity MS Revealed the Molecular Mechanism of Ginsenoside Antitumor Effects.定量蛋白质组学联合亲和 MS 揭示了人参皂苷抗肿瘤作用的分子机制。

J Proteome Res. 2019 May 3;18(5):2100-2108. doi: 10.1021/acs.jproteome.8b00972. Epub 2019 Mar 29.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

StackZDPD：一种新型质谱数据编码方案，针对速度和压缩比进行了优化。

StackZDPD: a novel encoding scheme for mass spectrometry data optimized for speed and compression ratio.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献