Lam Barbara D, Ma Shengling, Kovalenko Iuliia, Wang Peiqi, Jafari Omid, Li Ang, Horng Steven
Division of Hematology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA.
Division of Clinical Informatics, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA.
Res Pract Thromb Haemost. 2025 May 21;9(4):102896. doi: 10.1016/j.rpth.2025.102896. eCollection 2025 May.
Pulmonary embolism (PE) is a leading cause of preventable in-hospital mortality. Advances in diagnosis, risk stratification, and prevention can improve outcomes. Large, publicly available datasets are needed to move research forward, but are lacking in the field of hemostasis and thrombosis.
In this study, we experiment using a machine learning language model to automatically add PE labels to a large dataset.
We extracted all computed tomography pulmonary angiography radiology reports ( = 19,942) from the Medical Information Mart for Intensive Care IV, a database of adult patients who presented to the emergency room or were admitted to the intensive care unit at one tertiary care center between 2008 and 2019. Two physicians manually labeled each report result as PE positive (acute PE) or PE negative. Using this as our gold standard, we compared the performance of a fine-tuned Bio_ClinicalBERT (bidirectional encoder representations from transformers) transformer language model, known as venous thromboembolism-BERT, with diagnosis codes in the ability to classify reports as PE positive or negative.
Venous thromboembolism-BERT had a sensitivity of 92.4% and a positive predictive value of 87.8% in all 19,942 computed tomography pulmonary angiography reports. Diagnosis codes had a sensitivity of 95.4% and a positive predictive value of 83.8% in the subset of 11,990 reports with an associated discharge diagnosis code.
We successfully added nearly 20,000 PE labels to the publicly available Medical Information Mart for Intensive Care IV database and demonstrated how a transformer language model can automate and accelerate hematologic research.
肺栓塞(PE)是可预防的院内死亡的主要原因。诊断、风险分层和预防方面的进展可改善治疗结果。推动研究向前发展需要大型的公开可用数据集,但在止血和血栓形成领域却很缺乏。
在本研究中,我们尝试使用机器学习语言模型自动为一个大型数据集添加PE标签。
我们从重症监护医学信息数据库IV中提取了所有计算机断层扫描肺血管造影放射学报告(n = 19942),该数据库包含2008年至2019年间在一家三级医疗中心就诊于急诊室或入住重症监护病房的成年患者。两名医生将每份报告结果手动标记为PE阳性(急性PE)或PE阴性。以此作为我们的金标准,我们比较了一种经过微调的Bio_ClinicalBERT(来自变换器的双向编码器表示)变换器语言模型(称为静脉血栓栓塞症-BERT)与诊断代码在将报告分类为PE阳性或阴性方面的性能。
在所有19942份计算机断层扫描肺血管造影报告中,静脉血栓栓塞症-BERT的灵敏度为92.4%,阳性预测值为87.8%。在11990份有相关出院诊断代码的报告子集中,诊断代码的灵敏度为95.4%,阳性预测值为83.8%。
我们成功地为公开可用的重症监护医学信息数据库IV添加了近20000个PE标签,并展示了变换器语言模型如何能够自动化和加速血液学研究。