Suppr超能文献

高效访问mzML文件的编程方法比较

Comparison of Programmatic Approaches for Efficient Accessing to mzML Files.

作者信息

Gilski Miroslaw J, Sadygov Rovshan G

机构信息

Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, 301 University Blvd., Galveston, TX, 77555, USA.

出版信息

J Data Mining Genomics Proteomics. 2011 Jan 1;2(1). doi: 10.4172/2153-0602.1000109.

Abstract

The Human Proteome Organization (HUPO) Proteomics Standard Initiative has been tasked with developing file formats for storing raw data (mzML) and the results of spectral processing (protein identification and quantification) from proteomics experiments (mzIndentML). In order to fully characterize complex experiments, special data types have been designed. Standardized file formats will promote visualization, validation and dissemination of data independent of the vendor-specific binary data storage files. Innovative programmatic solutions for robust and efficient data access to standardized file formats will contribute to more rapid wide-scale acceptance of these file formats by the proteomics community.In this work, we compare algorithms for accessing spectral data in the mzML file format. As an XML file, mzML files allow efficient parsing of data structures when using XML-specific class types. These classes provide only sequential access to files. However, random access to spectral data is needed in many algorithmic applications for processing proteomics datasets. Here, we demonstrate implementation of memory streams to convert a sequential access into random access. Our application preserves the elegant XML parsing capabilities. Benchmarking file access times in sequential and random access modes show that while for small number of spectra the random access is more time efficient, when retrieving large number of spectra sequential access becomes more efficient. We also provide comparisons to other file accessing methods from academia and industry.

摘要

人类蛋白质组组织(HUPO)蛋白质组学标准倡议组织的任务是开发用于存储蛋白质组学实验原始数据(mzML)以及光谱处理结果(蛋白质鉴定和定量)的文件格式(mzIndentML)。为了全面表征复杂实验,已设计了特殊的数据类型。标准化文件格式将促进数据的可视化、验证和传播,而不受特定供应商二进制数据存储文件的限制。用于对标准化文件格式进行强大而高效的数据访问的创新编程解决方案,将有助于蛋白质组学界更快地广泛接受这些文件格式。在这项工作中,我们比较了以mzML文件格式访问光谱数据的算法。作为XML文件,mzML文件在使用特定于XML的类类型时允许对数据结构进行高效解析。这些类仅提供对文件的顺序访问。然而,在许多用于处理蛋白质组学数据集的算法应用中,需要对光谱数据进行随机访问。在此,我们展示了内存流的实现,以将顺序访问转换为随机访问。我们的应用程序保留了出色的XML解析功能。对顺序访问和随机访问模式下的文件访问时间进行基准测试表明,虽然对于少量光谱,随机访问在时间上更高效,但在检索大量光谱时,顺序访问会变得更高效。我们还与学术界和工业界的其他文件访问方法进行了比较。

相似文献

1
Comparison of Programmatic Approaches for Efficient Accessing to mzML Files.高效访问mzML文件的编程方法比较
J Data Mining Genomics Proteomics. 2011 Jan 1;2(1). doi: 10.4172/2153-0602.1000109.
2
Fast and Efficient XML Data Access for Next-Generation Mass Spectrometry.面向下一代质谱分析的快速高效XML数据访问
PLoS One. 2015 Apr 30;10(4):e0125108. doi: 10.1371/journal.pone.0125108. eCollection 2015.
3
Numerical compression schemes for proteomics mass spectrometry data.蛋白质组学质谱数据的数值压缩方案。
Mol Cell Proteomics. 2014 Jun;13(6):1537-42. doi: 10.1074/mcp.O114.037879. Epub 2014 Mar 27.
8
Mass spectrometer output file format mzML.质谱仪输出文件格式为mzML。
Methods Mol Biol. 2010;604:319-31. doi: 10.1007/978-1-60761-444-9_22.

本文引用的文献

2
An efficient data format for mass spectrometry-based proteomics.基于质谱的蛋白质组学的高效数据格式。
J Am Soc Mass Spectrom. 2010 Oct;21(10):1784-8. doi: 10.1016/j.jasms.2010.06.014. Epub 2010 Jul 7.
5
A guided tour of the Trans-Proteomic Pipeline.《跨蛋白质组学分析流程指南》
Proteomics. 2010 Mar;10(6):1150-9. doi: 10.1002/pmic.200900375.
10
Validation of tandem mass spectrometry database search results using DTASelect.使用DTASelect验证串联质谱数据库搜索结果。
Curr Protoc Bioinformatics. 2007 Jan;Chapter 13:Unit 13.4. doi: 10.1002/0471250953.bi1304s16.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验