Zielosko Beata, Jabloński Kamil, Dmytrenko Anton
Institute of Computer Science, University of Silesia in Katowice, Bȩdzińska 39, 41-200 Sosnowiec, Poland.
Entropy (Basel). 2025 Mar 7;27(3):278. doi: 10.3390/e27030278.
Data heterogeneity is the result of increasing data volumes, technological advances, and growing business requirements in the IT environment. It means that data comes from different sources, may be dispersed in terms of location, and may be stored in different structures and formats. As a result, the management of distributed data requires special integration and analysis techniques to ensure coherent processing and a global view. Distributed learning systems often use entropy-based measures to assess the quality of local data and its impact on the global model. One important aspect of data processing is feature selection. This paper proposes a research methodology for multi-level attribute ranking construction for distributed data. The research was conducted on a publicly available dataset from the UCI Machine Learning Repository. In order to disperse the data, a table division into subtables was applied using reducts, which is a very well-known method from the rough sets theory. So-called local rankings were constructed for local data sources using an approach based on machine learning models, i.e., the greedy algorithm for the induction of decision rules. Two types of classifiers relating to explicit and implicit knowledge representation, i.e., gradient boosting and neural networks, were used to verify the research methodology. Extensive experiments, comparisons, and analysis of the obtained results show the merit of the proposed approach.
数据异构性是信息技术环境中数据量不断增加、技术进步以及业务需求不断增长的结果。这意味着数据来自不同的来源,可能在地理位置上分散,并且可能以不同的结构和格式存储。因此,分布式数据的管理需要特殊的集成和分析技术,以确保连贯的处理和全局视角。分布式学习系统通常使用基于熵的度量来评估本地数据的质量及其对全局模型的影响。数据处理的一个重要方面是特征选择。本文提出了一种用于分布式数据的多级属性排名构建的研究方法。该研究是在来自UCI机器学习库的一个公开可用数据集上进行的。为了分散数据,使用约简将表划分为子表,这是粗糙集理论中一种非常著名的方法。使用基于机器学习模型的方法,即用于归纳决策规则的贪心算法,为本地数据源构建所谓的本地排名。使用与显式和隐式知识表示相关的两种类型的分类器,即梯度提升和神经网络,来验证该研究方法。对所得结果进行的广泛实验、比较和分析表明了所提出方法的优点。