Kim Minkyoung, Kim Yunha, Kang Hee Jun, Seo Hyeram, Choi Heejung, Han JiYe, Kee Gaeun, Ko Soyoung, Jung HyoJe, Kim Byeolhee, Choi Boeun, Jun Tae Joon, Kim Young-Hak
Department of Information Medicine, Asan Medical Center, 88, Olympic-ro 43-gil, Songpa-gu, Seoul, 05505, Republic of Korea.
Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Olympic-ro 43-gil, Songpa-gu, Seoul, 05505, Republic of Korea.
BMC Med Inform Decis Mak. 2025 Aug 11;25(1):300. doi: 10.1186/s12911-025-03145-x.
The integration of electronic medical records (EMRs) with artificial intelligence (AI) is enhancing medical research, particularly in real-world evidence (RWE) studies. Extracting insights from coded medical data, such as ICD-10 codes, is essential for patient characterization. Traditional techniques, such as one-hot encoding (OHE), face limitations, particularly in managing high-dimensional data. In this study, a Bidirectional Encoder Representations from Transformers (BERT) approach is introduced to encode ICD-10 diagnostic codes, significantly improving model performance and reducing dimensionality. Data from 495,269 patients who visited the Cardiology Department at Asan Medical Center between 2000 and 2020 were used. The performance of models trained with OHE and ClinicalBERT embeddings was compared. For predicting major adverse cardiovascular events within one year following percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), the ClinicalBERT (code-embedded) model outperformed OHE. It achieved an AUC of 0.746 compared to 0.719, while also significantly reducing the dimensionality from 2,492 to 128. This method, which integrates diagnostic and medication data, provides valuable insights into patient care, enhancing the precision of predictions and supporting healthcare professionals in making more informed decisions.
电子病历(EMR)与人工智能(AI)的整合正在推动医学研究,尤其是在真实世界证据(RWE)研究方面。从编码的医学数据(如ICD-10编码)中提取见解对于患者特征描述至关重要。传统技术,如独热编码(OHE),存在局限性,尤其是在处理高维数据时。在本研究中,引入了一种基于变换器的双向编码器表征(BERT)方法来对ICD-10诊断编码进行编码,显著提高了模型性能并降低了维度。使用了2000年至2020年间访问峨山医院心血管科的495269名患者的数据。比较了使用OHE和ClinicalBERT嵌入训练的模型的性能。对于预测经皮冠状动脉介入治疗(PCI)或冠状动脉旁路移植术(CABG)后一年内的主要不良心血管事件,ClinicalBERT(编码嵌入)模型优于OHE。它的曲线下面积(AUC)为0.746,而OHE为0.719,同时维度也从2492显著降低到128。这种整合诊断和用药数据的方法为患者护理提供了有价值的见解,提高了预测的准确性,并支持医疗保健专业人员做出更明智的决策。