基于粗糙集的信息论聚类不确定分类数据方法。

Qurtuba University of Science & IT, Peshawar, Pakistan.

Universiti Tun Hussien Onn Malaysia, Batu Pahat, Johor, Malaysia.

PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.

MOTIVATION

Many real applications such as businesses and health generate large categorical datasets with uncertainty. A fundamental task is to efficiently discover hidden and non-trivial patterns from such large uncertain categorical datasets. Since the exact value of an attribute is often unknown in uncertain categorical datasets, conventional clustering analysis algorithms do not provide a suitable means for dealing with categorical data, uncertainty, and stability.

PROBLEM STATEMENT

The ability of decision making in the presence of vagueness and uncertainty in data can be handled using Rough Set Theory. Though, recent categorical clustering techniques based on Rough Set Theory help but they suffer from low accuracy, high computational complexity, and generalizability especially on data sets where they sometimes fail or hardly select their best clustering attribute.

OBJECTIVES

The main objective of this research is to propose a new information theoretic based Rough Purity Approach (RPA). Another objective of this work is to handle the problems of traditional Rough Set Theory based categorical clustering techniques. Hence, the ultimate goal is to cluster uncertain categorical datasets efficiently in terms of the performance, generalizability and computational complexity.

METHODS

The RPA takes into consideration information-theoretic attribute purity of the categorical-valued information systems. Several extensive experiments are conducted to evaluate the efficiency of RPA using a real Supplier Base Management (SBM) and six benchmark UCI datasets. The proposed RPA is also compared with several recent categorical data clustering techniques.

RESULTS

The experimental results show that RPA outperforms the baseline algorithms. The significant percentage improvement with respect to time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), and accuracy (12.15%) as well as Rough Accuracy of clusters show that RPA is suitable for practical usage.

CONCLUSION

We conclude that as compared to other techniques, the attribute purity of categorical-valued information systems can better cluster the data. Hence, RPA technique can be recommended for large scale clustering in multiple domains and its performance can be enhanced for further research.

动机

许多实际应用，如商业和健康，会产生带有不确定性的大型类别数据集。一项基本任务是从这些大型不确定类别数据集中高效地发现隐藏的和非平凡的模式。由于不确定类别数据集中属性的确切值通常是未知的，因此传统的聚类分析算法不适用于处理类别数据、不确定性和稳定性。

问题陈述

在数据存在模糊性和不确定性的情况下，决策能力可以使用粗糙集理论来处理。尽管基于粗糙集理论的最近类别聚类技术有所帮助，但它们存在准确性低、计算复杂度高和通用性差的问题，尤其是在它们有时无法选择最佳聚类属性的数据集中。

目标

本研究的主要目标是提出一种新的基于信息论的粗糙纯度方法（RPA）。本工作的另一个目标是处理基于传统粗糙集理论的类别聚类技术的问题。因此，最终目标是以性能、通用性和计算复杂度为标准，有效地对不确定的类别数据集进行聚类。

方法

RPA 考虑了类别值信息系统的信息论属性纯度。使用真实的供应商基础管理（SBM）和六个基准 UCI 数据集进行了多项广泛的实验，以评估 RPA 的效率。还将提出的 RPA 与几种最近的类别数据聚类技术进行了比较。

结果

实验结果表明，RPA 优于基线算法。在时间（66.70%）、迭代（83.13%）、纯度（10.53%）、熵（14%）和准确性（12.15%）以及聚类的粗糙准确性方面都有显著的百分比提高，表明 RPA 适用于实际使用。

结论

与其他技术相比，我们得出结论，类别值信息系统的属性纯度可以更好地对数据进行聚类。因此，建议在多个领域中使用 RPA 技术进行大规模聚类，并且可以进一步研究提高其性能。

相似文献

Rough set based information theoretic approach for clustering uncertain categorical data.

PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.

An Empirical Analysis of Rough Set Categorical Clustering Techniques.

PLoS One. 2017 Jan 9;12(1):e0164803. doi: 10.1371/journal.pone.0164803. eCollection 2017.

R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.

Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.

A rough set based algorithm for updating the modes in categorical clustering.

Int J Mach Learn Cybern. 2021;12(7):2069-2090. doi: 10.1007/s13042-021-01293-w. Epub 2021 Mar 27.

An Attribute Reduction Method Using Neighborhood Entropy Measures in Neighborhood Rough Sets.

Entropy (Basel). 2019 Feb 7;21(2):155. doi: 10.3390/e21020155.

A Neighborhood Rough Sets-Based Attribute Reduction Method Using Lebesgue and Entropy Measures.

Entropy (Basel). 2019 Feb 1;21(2):138. doi: 10.3390/e21020138.

Coupled attribute similarity learning on categorical data.

IEEE Trans Neural Netw Learn Syst. 2015 Apr;26(4):781-97. doi: 10.1109/TNNLS.2014.2325872.

Brain tissue segmentation using improved kernelized rough-fuzzy C-means with spatio-contextual information from MRI.

Magn Reson Imaging. 2019 Oct;62:129-151. doi: 10.1016/j.mri.2019.06.010. Epub 2019 Jun 25.

Soft ordered double quantitative approximations based three-way decisions and their applications.

Sci Rep. 2022 Nov 10;12(1):19211. doi: 10.1038/s41598-022-20982-2.

Sci Rep. 2024 Mar 12;14(1):5958. doi: 10.1038/s41598-024-55902-z.

本文引用的文献

An Empirical Analysis of Rough Set Categorical Clustering Techniques.

PLoS One. 2017 Jan 9;12(1):e0164803. doi: 10.1371/journal.pone.0164803. eCollection 2017.

Enhancing Predictive Accuracy of Cardiac Autonomic Neuropathy Using Blood Biochemistry Features and Iterative Multitier Ensembles.

IEEE J Biomed Health Inform. 2016 Jan;20(1):408-15. doi: 10.1109/JBHI.2014.2363177. Epub 2014 Oct 20.

Multistage approach for clustering and classification of ECG data.

Comput Methods Programs Biomed. 2013 Dec;112(3):720-30. doi: 10.1016/j.cmpb.2013.08.002. Epub 2013 Aug 28.

Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets.

Water Res. 2007 Nov;41(19):4566-78. doi: 10.1016/j.watres.2007.06.030. Epub 2007 Jun 16.

Survey of clustering algorithms.

IEEE Trans Neural Netw. 2005 May;16(3):645-78. doi: 10.1109/TNN.2005.845141.

Cluster analysis of gene expression data based on self-splitting and merging competitive learning.

IEEE Trans Inf Technol Biomed. 2004 Mar;8(1):5-15. doi: 10.1109/titb.2004.824724.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Rough set based information theoretic approach for clustering uncertain categorical data.

PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.

An Empirical Analysis of Rough Set Categorical Clustering Techniques.

PLoS One. 2017 Jan 9;12(1):e0164803. doi: 10.1371/journal.pone.0164803. eCollection 2017.

R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.

Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.

A rough set based algorithm for updating the modes in categorical clustering.

Int J Mach Learn Cybern. 2021;12(7):2069-2090. doi: 10.1007/s13042-021-01293-w. Epub 2021 Mar 27.

An Attribute Reduction Method Using Neighborhood Entropy Measures in Neighborhood Rough Sets.

Entropy (Basel). 2019 Feb 7;21(2):155. doi: 10.3390/e21020155.

A Neighborhood Rough Sets-Based Attribute Reduction Method Using Lebesgue and Entropy Measures.

Entropy (Basel). 2019 Feb 1;21(2):138. doi: 10.3390/e21020138.

Coupled attribute similarity learning on categorical data.

IEEE Trans Neural Netw Learn Syst. 2015 Apr;26(4):781-97. doi: 10.1109/TNNLS.2014.2325872.

Brain tissue segmentation using improved kernelized rough-fuzzy C-means with spatio-contextual information from MRI.

Magn Reson Imaging. 2019 Oct;62:129-151. doi: 10.1016/j.mri.2019.06.010. Epub 2019 Jun 25.

Soft ordered double quantitative approximations based three-way decisions and their applications.

Sci Rep. 2022 Nov 10;12(1):19211. doi: 10.1038/s41598-022-20982-2.

Sci Rep. 2024 Mar 12;14(1):5958. doi: 10.1038/s41598-024-55902-z.

本文引用的文献

An Empirical Analysis of Rough Set Categorical Clustering Techniques.

PLoS One. 2017 Jan 9;12(1):e0164803. doi: 10.1371/journal.pone.0164803. eCollection 2017.

Enhancing Predictive Accuracy of Cardiac Autonomic Neuropathy Using Blood Biochemistry Features and Iterative Multitier Ensembles.

IEEE J Biomed Health Inform. 2016 Jan;20(1):408-15. doi: 10.1109/JBHI.2014.2363177. Epub 2014 Oct 20.

Multistage approach for clustering and classification of ECG data.

Comput Methods Programs Biomed. 2013 Dec;112(3):720-30. doi: 10.1016/j.cmpb.2013.08.002. Epub 2013 Aug 28.

Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets.

Water Res. 2007 Nov;41(19):4566-78. doi: 10.1016/j.watres.2007.06.030. Epub 2007 Jun 16.

Survey of clustering algorithms.

IEEE Trans Neural Netw. 2005 May;16(3):645-78. doi: 10.1109/TNN.2005.845141.

Cluster analysis of gene expression data based on self-splitting and merging competitive learning.

IEEE Trans Inf Technol Biomed. 2004 Mar;8(1):5-15. doi: 10.1109/titb.2004.824724.

Rough set based information theoretic approach for clustering uncertain categorical data.

机构信息

出版信息

MOTIVATION

PROBLEM STATEMENT

OBJECTIVES

METHODS

RESULTS

CONCLUSION

动机

问题陈述

目标

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献