Qurtuba University of Science & IT, Peshawar, Pakistan.
Universiti Tun Hussien Onn Malaysia, Batu Pahat, Johor, Malaysia.
PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.
Many real applications such as businesses and health generate large categorical datasets with uncertainty. A fundamental task is to efficiently discover hidden and non-trivial patterns from such large uncertain categorical datasets. Since the exact value of an attribute is often unknown in uncertain categorical datasets, conventional clustering analysis algorithms do not provide a suitable means for dealing with categorical data, uncertainty, and stability.
The ability of decision making in the presence of vagueness and uncertainty in data can be handled using Rough Set Theory. Though, recent categorical clustering techniques based on Rough Set Theory help but they suffer from low accuracy, high computational complexity, and generalizability especially on data sets where they sometimes fail or hardly select their best clustering attribute.
The main objective of this research is to propose a new information theoretic based Rough Purity Approach (RPA). Another objective of this work is to handle the problems of traditional Rough Set Theory based categorical clustering techniques. Hence, the ultimate goal is to cluster uncertain categorical datasets efficiently in terms of the performance, generalizability and computational complexity.
The RPA takes into consideration information-theoretic attribute purity of the categorical-valued information systems. Several extensive experiments are conducted to evaluate the efficiency of RPA using a real Supplier Base Management (SBM) and six benchmark UCI datasets. The proposed RPA is also compared with several recent categorical data clustering techniques.
The experimental results show that RPA outperforms the baseline algorithms. The significant percentage improvement with respect to time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), and accuracy (12.15%) as well as Rough Accuracy of clusters show that RPA is suitable for practical usage.
We conclude that as compared to other techniques, the attribute purity of categorical-valued information systems can better cluster the data. Hence, RPA technique can be recommended for large scale clustering in multiple domains and its performance can be enhanced for further research.
许多实际应用,如商业和健康,会产生带有不确定性的大型类别数据集。一项基本任务是从这些大型不确定类别数据集中高效地发现隐藏的和非平凡的模式。由于不确定类别数据集中属性的确切值通常是未知的,因此传统的聚类分析算法不适用于处理类别数据、不确定性和稳定性。
在数据存在模糊性和不确定性的情况下,决策能力可以使用粗糙集理论来处理。尽管基于粗糙集理论的最近类别聚类技术有所帮助,但它们存在准确性低、计算复杂度高和通用性差的问题,尤其是在它们有时无法选择最佳聚类属性的数据集中。
本研究的主要目标是提出一种新的基于信息论的粗糙纯度方法(RPA)。本工作的另一个目标是处理基于传统粗糙集理论的类别聚类技术的问题。因此,最终目标是以性能、通用性和计算复杂度为标准,有效地对不确定的类别数据集进行聚类。
RPA 考虑了类别值信息系统的信息论属性纯度。使用真实的供应商基础管理(SBM)和六个基准 UCI 数据集进行了多项广泛的实验,以评估 RPA 的效率。还将提出的 RPA 与几种最近的类别数据聚类技术进行了比较。
实验结果表明,RPA 优于基线算法。在时间(66.70%)、迭代(83.13%)、纯度(10.53%)、熵(14%)和准确性(12.15%)以及聚类的粗糙准确性方面都有显著的百分比提高,表明 RPA 适用于实际使用。
与其他技术相比,我们得出结论,类别值信息系统的属性纯度可以更好地对数据进行聚类。因此,建议在多个领域中使用 RPA 技术进行大规模聚类,并且可以进一步研究提高其性能。