使用Hive和Spark对复杂XSD进行高效处理。

Efficient processing of complex XSD using Hive and Spark.

作者信息

Martinez-Mosquera Diana, Navarrete Rosa, Luján-Mora Sergio

机构信息

Department of Informatics and Computer Science, Escuela Politecnica Nacional, Quito, Ecuador.

Department of Software and Computing Systems, University of Alicante, Alicante, Spain.

出版信息

PeerJ Comput Sci. 2021 Aug 17;7:e652. doi: 10.7717/peerj-cs.652. eCollection 2021.

DOI:10.7717/peerj-cs.652

PMID:34497870

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8384044/

Abstract

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.

摘要

可扩展标记语言（XML）文件因其在表示多种数据方面的灵活性而被业界广泛使用。金融记录、社交网络和移动网络等多种应用程序使用具有嵌套类型、内容和/或基于现有复杂元素或大型实际文件的扩展的复杂XML模式。每天都会生成大量此类文件，这影响了用于解析和报告的大数据工具（如Apache Hive和Apache Spark）的开发。由于这些原因，多项研究提出了新技术并评估了使用大数据系统处理XML文件的情况。然而，此类工作中更常见的方法涉及最简单的XML模式，尽管实际数据集由复杂模式组成。因此，为了阐明使用大数据工具处理实际应用中的复杂XML模式，我们提出了一种结合三种技术的方法。这包括解析XML文件的三种主要方法：编目、反序列化和位置分解。对于编目，XML模式的元素被映射到根、数组、结构、值和属性。基于这些元素，可以直接实现反序列化和位置分解。为了证明我们提议的有效性，我们通过实现一个测试环境来开发一个案例研究，以使用来自两个移动网络供应商性能管理的实际数据集来说明这些方法。我们的主要结果表明了所提方法对不同版本的Apache Hive和Apache Spark的有效性，获得了Apache Hive内部表和外部表以及Apache Spark数据帧的查询执行时间，并比较了Apache Hive和Apache Spark中的查询性能。另一个贡献是一个案例研究，其中为移动网络性能管理系统中的数据分析提出了一种新颖的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cff7/8384044/4997e50e02cf/peerj-cs-07-652-g001.jpg

相似文献

Efficient processing of complex XSD using Hive and Spark.使用Hive和Spark对复杂XSD进行高效处理。

PeerJ Comput Sci. 2021 Aug 17;7:e652. doi: 10.7717/peerj-cs.652. eCollection 2021.

Accessing complex patient data from Arden Syntax Medical Logic Modules.从 Arden Syntax 医学逻辑模块访问复杂的患者数据。

Artif Intell Med. 2018 Nov;92:95-102. doi: 10.1016/j.artmed.2015.09.003. Epub 2015 Sep 12.

Enabling Massive XML-Based Biological Data Management in HBase.在 HBase 中实现基于 XML 的大规模生物数据管理。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):1994-2004. doi: 10.1109/TCBB.2019.2915811. Epub 2020 Dec 8.

Human Behavior Analysis Using Intelligent Big Data Analytics.利用智能大数据分析进行人类行为分析

Front Psychol. 2021 Jul 6;12:686610. doi: 10.3389/fpsyg.2021.686610. eCollection 2021.

XML Schema Representation of DICOM Structured Reporting.DICOM结构化报告的XML模式表示

J Am Med Inform Assoc. 2003 Mar-Apr;10(2):213-23. doi: 10.1197/jamia.m1042.

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.优化的分布式系统在大规模 VCF 文件的排序合并方面实现了显著的性能提升。

Gigascience. 2018 Jun 1;7(6). doi: 10.1093/gigascience/giy052.

Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据：Apache Spark 与 MPI。

PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.

Web-based infectious disease reporting using XML forms.使用XML表单的基于网络的传染病报告。

Int J Med Inform. 2008 Sep;77(9):630-40. doi: 10.1016/j.ijmedinf.2007.10.011. Epub 2007 Dec 3.

Big data clustering techniques based on Spark: a literature review.基于Spark的大数据聚类技术：文献综述

PeerJ Comput Sci. 2020 Nov 30;6:e321. doi: 10.7717/peerj-cs.321. eCollection 2020.

Big Data Approaches for the Analysis of Large-Scale fMRI Data Using Apache Spark and GPU Processing: A Demonstration on Resting-State fMRI Data from the Human Connectome Project.使用Apache Spark和GPU处理分析大规模功能磁共振成像数据的大数据方法：来自人类连接体项目静息态功能磁共振成像数据的演示

Front Neurosci. 2016 Jan 6;9:492. doi: 10.3389/fnins.2015.00492. eCollection 2015.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用Hive和Spark对复杂XSD进行高效处理。

Efficient processing of complex XSD using Hive and Spark.

作者信息

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献