Gilski Miroslaw J, Sadygov Rovshan G
Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, 301 University Blvd., Galveston, TX, 77555, USA.
J Data Mining Genomics Proteomics. 2011 Jan 1;2(1). doi: 10.4172/2153-0602.1000109.
The Human Proteome Organization (HUPO) Proteomics Standard Initiative has been tasked with developing file formats for storing raw data (mzML) and the results of spectral processing (protein identification and quantification) from proteomics experiments (mzIndentML). In order to fully characterize complex experiments, special data types have been designed. Standardized file formats will promote visualization, validation and dissemination of data independent of the vendor-specific binary data storage files. Innovative programmatic solutions for robust and efficient data access to standardized file formats will contribute to more rapid wide-scale acceptance of these file formats by the proteomics community.In this work, we compare algorithms for accessing spectral data in the mzML file format. As an XML file, mzML files allow efficient parsing of data structures when using XML-specific class types. These classes provide only sequential access to files. However, random access to spectral data is needed in many algorithmic applications for processing proteomics datasets. Here, we demonstrate implementation of memory streams to convert a sequential access into random access. Our application preserves the elegant XML parsing capabilities. Benchmarking file access times in sequential and random access modes show that while for small number of spectra the random access is more time efficient, when retrieving large number of spectra sequential access becomes more efficient. We also provide comparisons to other file accessing methods from academia and industry.
人类蛋白质组组织(HUPO)蛋白质组学标准倡议组织的任务是开发用于存储蛋白质组学实验原始数据(mzML)以及光谱处理结果(蛋白质鉴定和定量)的文件格式(mzIndentML)。为了全面表征复杂实验,已设计了特殊的数据类型。标准化文件格式将促进数据的可视化、验证和传播,而不受特定供应商二进制数据存储文件的限制。用于对标准化文件格式进行强大而高效的数据访问的创新编程解决方案,将有助于蛋白质组学界更快地广泛接受这些文件格式。在这项工作中,我们比较了以mzML文件格式访问光谱数据的算法。作为XML文件,mzML文件在使用特定于XML的类类型时允许对数据结构进行高效解析。这些类仅提供对文件的顺序访问。然而,在许多用于处理蛋白质组学数据集的算法应用中,需要对光谱数据进行随机访问。在此,我们展示了内存流的实现,以将顺序访问转换为随机访问。我们的应用程序保留了出色的XML解析功能。对顺序访问和随机访问模式下的文件访问时间进行基准测试表明,虽然对于少量光谱,随机访问在时间上更高效,但在检索大量光谱时,顺序访问会变得更高效。我们还与学术界和工业界的其他文件访问方法进行了比较。