Wagan Asif Ali, Talpur Shahnawaz, Narejo Sanam
Computer Systems Engineering, Mehran University of Engineering & Technology Jamshoro, Jamshoro, Sindh, Pakistan.
PeerJ Comput Sci. 2024 Oct 2;10:e2315. doi: 10.7717/peerj-cs.2315. eCollection 2024.
In various fields, including medical science, datasets characterized by uncertainty are generated. Conventional clustering algorithms, designed for deterministic data, often prove inadequate when applied to uncertain data, posing significant challenges. Recent advancements have introduced clustering algorithms based on a possible world model, specifically designed to handle uncertainty, showing promising outcomes. However, these algorithms face two primary issues. First, they treat all possible worlds equally, neglecting the relative importance of each world. Second, they employ time-consuming and inefficient post-processing techniques for world selection. This research aims to create clusters of observed symptoms in patients, enabling the exploration of intricate relationships between symptoms. However, the symptoms dataset presents unique challenges, as it entails uncertainty and exhibits overlapping symptoms across multiple diseases, rendering the formation of mutually exclusive clusters impractical. Conventional similarity measures, assuming mutually exclusive clusters, fail to address these challenges effectively. Furthermore, the categorical nature of the symptoms dataset further complicates the analysis, as most similarity measures are optimized for numerical datasets. To overcome these scientific obstacles, this research proposes an innovative clustering algorithm that considers the precise weight of each symptom in every disease, facilitating the generation of overlapping clusters that accurately depict the associations between symptoms in the context of various diseases.
在包括医学在内的各个领域,都会生成具有不确定性特征的数据集。为确定性数据设计的传统聚类算法,在应用于不确定数据时往往显得不足,带来了重大挑战。最近的进展引入了基于可能世界模型的聚类算法,专门用于处理不确定性,显示出有前景的成果。然而,这些算法面临两个主要问题。首先,它们平等对待所有可能世界,忽略了每个世界的相对重要性。其次,它们采用耗时且低效的后处理技术进行世界选择。本研究旨在对患者观察到的症状进行聚类,以便探索症状之间的复杂关系。然而,症状数据集带来了独特的挑战,因为它存在不确定性,且多种疾病的症状存在重叠,使得形成相互排斥的聚类不切实际。传统的相似性度量假设聚类相互排斥,无法有效应对这些挑战。此外,症状数据集的分类性质进一步使分析复杂化,因为大多数相似性度量是针对数值数据集进行优化的。为了克服这些科学障碍,本研究提出了一种创新的聚类算法,该算法考虑了每种疾病中每个症状的精确权重,有助于生成重叠聚类,从而准确描述各种疾病背景下症状之间的关联。