Ghasemi Peyman, Lee Joon
Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
Department of Biomedical Engineering, University of Calgary, Calgary, AB, Canada.
JMIR Med Inform. 2024 Jul 26;12:e52896. doi: 10.2196/52896.
The application of machine learning in health care often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications, respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the "curse of dimensionality" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD and ATC codes and the hierarchical structures of these systems.
The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of patients with coronary artery disease in different aspects of performance and complexity and select the best set of features representing these patients.
We compared several unsupervised feature selection methods for 2 ICD and 1 ATC code databases of 51,506 patients with coronary artery disease in Alberta, Canada. Specifically, we used the Laplacian score, unsupervised feature selection for multicluster data, autoencoder-inspired unsupervised feature selection, principal feature analysis, and concrete autoencoders with and without ICD or ATC tree weight adjustment to select the 100 best features from over 9000 ICD and 2000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of the selected features by mean code level in the ICD or ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis.
In feature space reconstruction and mortality prediction, the concrete autoencoder-based methods outperformed other techniques. Particularly, a weight-adjusted concrete autoencoder variant demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong and McNemar tests (P<.05). Concrete autoencoders preferred more general codes, and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted concrete autoencoders yielded higher Shapley values in mortality prediction than most alternatives.
This study scrutinized 5 feature selection methods in ICD and ATC code data sets in an unsupervised context. Our findings underscore the superiority of the concrete autoencoder method in selecting salient features that represent the entire data set, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the concrete autoencoders specifically tailored for ICD and ATC code data sets to enhance the generalizability and interpretability of the selected features.
机器学习在医疗保健中的应用通常需要使用分层编码,如国际疾病分类(ICD)和解剖治疗化学(ATC)系统。这些编码分别对疾病和药物进行分类,从而形成广泛的数据维度。无监督特征选择解决了“维度诅咒”问题,并通过减少无关或冗余特征的数量以及避免过拟合,有助于提高监督学习模型的准确性和性能。实施无监督特征选择技术,如过滤、包装和嵌入方法,以选择具有最内在信息的最重要特征。然而,由于ICD和ATC编码的数量庞大以及这些系统的层次结构,它们面临挑战。
本研究的目的是在性能和复杂性的不同方面比较几种针对冠心病患者的ICD和ATC编码数据库的无监督特征选择方法,并选择代表这些患者的最佳特征集。
我们比较了加拿大艾伯塔省51506例冠心病患者的2个ICD和1个ATC编码数据库的几种无监督特征选择方法。具体而言,我们使用拉普拉斯分数、多聚类数据的无监督特征选择、受自动编码器启发的无监督特征选择、主特征分析以及带有或不带有ICD或ATC树权重调整的具体自动编码器,从9000多个ICD编码和2000个ATC编码中选择100个最佳特征。我们根据所选特征重建初始特征空间的能力以及预测出院后90天死亡率的能力来评估这些特征。我们还通过ICD或ATC树中的平均编码级别比较所选特征的复杂性,并使用夏普利分析在死亡率预测任务中比较特征的可解释性。
在特征空间重建和死亡率预测方面,基于具体自动编码器的方法优于其他技术。特别是,一种权重调整后的具体自动编码器变体表现出提高的重建准确性和显著的预测性能增强,经德龙检验和麦克内马尔检验证实(P<0.05)。具体自动编码器更喜欢更通用的编码,并且它们始终准确地重建所有特征。此外,在死亡率预测中,通过权重调整后的具体自动编码器选择的特征比大多数其他方法产生更高的夏普利值。
本研究在无监督背景下仔细研究了ICD和ATC编码数据集中的5种特征选择方法。我们的研究结果强调了具体自动编码器方法在选择代表整个数据集的显著特征方面的优越性,为后续机器学习研究提供了潜在的资产。我们还提出了一种专门为ICD和ATC编码数据集量身定制的具体自动编码器的新型权重调整方法,以提高所选特征的通用性和可解释性。