Department of Electrical and Computer Engineering, Razi University, Kermanshah, Iran.
J Biomed Inform. 2021 Apr;116:103695. doi: 10.1016/j.jbi.2021.103695. Epub 2021 Feb 4.
The existing data mining solutions to identify risk factors associated with diseases are burdened with quite a few shortcomings. They usually use crisp partitions for numerical features and also do not use patient-specific profiles. These shortcomings create limitations for solving real problems. Discretizing a numerical feature through crisp partitions can also generate substantial partitioning errors, particularly for features whose values are closer to crisp boundaries. Since the normal range of each numerical feature varies according to the age, gender, and medical conditions of the patients, then ignoring these differences can undermine the accuracy of the extracted itemsets and rules. This paper presents a profile-based fuzzy association rule mining (PB-FARM) approach for the assessment of risk factors highly correlated with diseases. The proposed approach has three phases. Phase I involves creating profiles for patients based on their age, gender, and medical conditions, to determine a normal range of each numerical feature. Then fuzzy partitioning is done for all features (namely, numerical and categorical), and consequently, a structure, called FirstScan, is created. In Phase II, the FirstScan structure is utilized to mine for large fuzzy k-itemsets. Ultimately, in Phase III, the given k-itemsets are employed to generate fuzzy rules for associations between risk factors and diseases. To evaluate the performance of the proposed method the Z-Alizadeh Sani coronary artery disease (CAD) dataset, containing 303 records and 54 features, was used. The results show a positive correlation between typical chest pain and old age with the incidence of CAD. The comparisons made in this study showed that, firstly, the proposed algorithm has a higher partitioning accuracy than other methods, and secondly, it has a reasonably short execution time.
现有的用于识别与疾病相关的风险因素的数据挖掘解决方案存在诸多缺陷。它们通常对数值特征使用硬性分区,并且不使用患者特定的档案。这些缺陷为解决实际问题带来了限制。通过硬性分区对数值特征进行离散化也会产生大量的分区错误,特别是对于那些值更接近硬性边界的特征。由于每个数值特征的正常范围都根据患者的年龄、性别和医疗状况而有所不同,因此忽略这些差异会降低提取的项集和规则的准确性。本文提出了一种基于档案的模糊关联规则挖掘(PB-FARM)方法,用于评估与疾病高度相关的风险因素。所提出的方法有三个阶段。第一阶段涉及根据患者的年龄、性别和医疗状况为患者创建档案,以确定每个数值特征的正常范围。然后对所有特征(即数值特征和类别特征)进行模糊分区,从而创建一个名为 FirstScan 的结构。在第二阶段,利用 FirstScan 结构挖掘大型模糊 k-项集。最后,在第三阶段,使用给定的 k-项集生成风险因素与疾病之间关联的模糊规则。为了评估所提出方法的性能,使用包含 303 条记录和 54 个特征的 Z-Alizadeh Sani 冠状动脉疾病(CAD)数据集进行了评估。结果表明,典型胸痛和高龄与 CAD 的发生之间存在正相关关系。本研究中的比较表明,首先,所提出的算法比其他方法具有更高的分区准确性,其次,它具有合理的短执行时间。