Martinez-Mosquera Diana, Navarrete Rosa, Luján-Mora Sergio
Department of Informatics and Computer Science, Escuela Politecnica Nacional, Quito, Ecuador.
Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
PeerJ Comput Sci. 2021 Aug 17;7:e652. doi: 10.7717/peerj-cs.652. eCollection 2021.
The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.
可扩展标记语言(XML)文件因其在表示多种数据方面的灵活性而被业界广泛使用。金融记录、社交网络和移动网络等多种应用程序使用具有嵌套类型、内容和/或基于现有复杂元素或大型实际文件的扩展的复杂XML模式。每天都会生成大量此类文件,这影响了用于解析和报告的大数据工具(如Apache Hive和Apache Spark)的开发。由于这些原因,多项研究提出了新技术并评估了使用大数据系统处理XML文件的情况。然而,此类工作中更常见的方法涉及最简单的XML模式,尽管实际数据集由复杂模式组成。因此,为了阐明使用大数据工具处理实际应用中的复杂XML模式,我们提出了一种结合三种技术的方法。这包括解析XML文件的三种主要方法:编目、反序列化和位置分解。对于编目,XML模式的元素被映射到根、数组、结构、值和属性。基于这些元素,可以直接实现反序列化和位置分解。为了证明我们提议的有效性,我们通过实现一个测试环境来开发一个案例研究,以使用来自两个移动网络供应商性能管理的实际数据集来说明这些方法。我们的主要结果表明了所提方法对不同版本的Apache Hive和Apache Spark的有效性,获得了Apache Hive内部表和外部表以及Apache Spark数据帧的查询执行时间,并比较了Apache Hive和Apache Spark中的查询性能。另一个贡献是一个案例研究,其中为移动网络性能管理系统中的数据分析提出了一种新颖的解决方案。