Askar Mohsen, Småbrekke Lars, Holsbø Einar, Bongo Lars Ailo, Svendsen Kristian
Department of Pharmacy, Faculty of Health Sciences, UiT-The Arctic University of Norway, PO Box 6050, Stakkevollan, N-9037 Tromsø, Norway.
Department of Computer Science, Faculty of Science and Technology, UiT-The Arctic University of Norway, PO, Box 6050 Stakkevollan, N-9037 Tromsø, Norway.
Explor Res Clin Soc Pharm. 2024 Jun 11;14:100463. doi: 10.1016/j.rcsop.2024.100463. eCollection 2024 Jun.
Machine learning (ML) prediction models in healthcare and pharmacy-related research face challenges with encoding high-dimensional Healthcare Coding Systems (HCSs) such as ICD, ATC, and DRG codes, given the trade-off between reducing model dimensionality and minimizing information loss.
To investigate using Network Analysis modularity as a method to group HCSs to improve encoding in ML models.
The MIMIC-III dataset was utilized to create a multimorbidity network in which ICD-9 codes are the nodes and the edges are the number of patients sharing the same ICD-9 code pairs. A modularity detection algorithm was applied using different resolution thresholds to generate 6 sets of modules. The impact of four grouping strategies on the performance of predicting 90-day Intensive Care Unit readmissions was assessed. The grouping strategies compared: 1) binary encoding of codes, 2) encoding codes grouped by network modules, 3) grouping codes to the highest level of ICD-9 hierarchy, and 4) grouping using the single-level Clinical Classification Software (CCS). The same methodology was also applied to encode DRG codes but limiting the comparison to a single modularity threshold to binary encoding.The performance was assessed using Logistic Regression, Support Vector Machine with a non-linear kernel, and Gradient Boosting Machines algorithms. Accuracy, Precision, Recall, AUC, and F1-score with 95% confidence intervals were reported.
Models utilized modularity encoding outperformed ungrouped codes binary encoding models. The accuracy improved across all algorithms ranging from 0.736 to 0.78 for the modularity encoding, to 0.727 to 0.779 for binary encoding. AUC, recall, and precision also improved across almost all algorithms. In comparison with other grouping approaches, modularity encoding generally showed slightly higher performance in AUC, ranging from 0.813 to 0.837, and precision, ranging from 0.752 to 0.782.
Modularity encoding enhances the performance of ML models in pharmacy research by effectively reducing dimensionality and retaining necessary information. Across the three algorithms used, models utilizing modularity encoding showed superior or comparable performance to other encoding approaches. Modularity encoding introduces other advantages such as it can be used for both hierarchical and non-hierarchical HCSs, the approach is clinically relevant, and can enhance ML models' clinical interpretation. A Python package has been developed to facilitate the use of the approach for future research.
在医疗保健和药学相关研究中,机器学习(ML)预测模型在对诸如ICD、ATC和DRG编码等高维医疗编码系统(HCS)进行编码时面临挑战,因为在降低模型维度和最小化信息损失之间存在权衡。
研究使用网络分析模块度作为一种对HCS进行分组的方法,以改进ML模型中的编码。
利用MIMIC-III数据集创建一个共病网络,其中ICD-9编码为节点,边为共享相同ICD-9编码对的患者数量。应用模块度检测算法,使用不同的分辨率阈值生成6组模块。评估了四种分组策略对预测90天重症监护病房再入院性能的影响。比较的分组策略为:1)编码的二进制编码,2)按网络模块分组的编码,3)将编码分组到ICD-9层次结构的最高级别,4)使用单级临床分类软件(CCS)进行分组。同样的方法也应用于对DRG编码进行编码,但将比较限制在二进制编码的单个模块度阈值上。使用逻辑回归、具有非线性核的支持向量机和梯度提升机算法评估性能。报告了准确率、精确率、召回率、AUC和F1分数以及95%置信区间。
使用模块度编码的模型优于未分组编码的二进制编码模型。在所有算法中,模块度编码的准确率从0.736提高到0.78,二进制编码的准确率从0.727提高到0.779。几乎所有算法的AUC、召回率和精确率也有所提高。与其他分组方法相比,模块度编码在AUC(范围为0.813至0.837)和精确率(范围为0.752至0.782)方面通常表现略高。
模块度编码通过有效降低维度并保留必要信息,提高了药学研究中ML模型的性能。在所使用的三种算法中,使用模块度编码的模型表现优于或与其他编码方法相当。模块度编码还具有其他优点,例如它可用于分层和非分层的HCS,该方法具有临床相关性,并且可以增强ML模型的临床解释性。已开发了一个Python包,以方便该方法在未来研究中的使用。