Jafari Omid, Ma Shengling, Lam Barbara D, Jiang Jun Y, Zhou Emily, Ranjan Mrinal, Ryu Justine, Bandyo Raka, Maghsoudi Arash, Peng Bo, Amos Christopher I, Oluyomi Abiodun, Fillmore Nathanael R, La Jennifer, Li Ang
Section of Hematology-Oncology, Baylor College of Medicine, Houston, TX.
Division of Hematology & Oncology, Fred Hutch Cancer Center, University of Washington.
J Thromb Haemost. 2025 Aug 1. doi: 10.1016/j.jtha.2025.07.021.
Accurate and rapid phenotyping of venous thromboembolism (VTE) in longitudinal studies is important. A natural language processing (NLP) tool externally validated in representative patients is lacking.
We designed a novel NLP platform, NLPMed, to assist thrombosis researchers with data preprocessing, phenotype annotation, language model finetuning, and NLP application. Utilizing clinical notes, discharge summaries, and radiology reports in patients with cancer from two healthcare institutions, we finetuned Bio_ClinicalBERT to develop VTE-BERT. The new model was trained to detect acute VTE events and their anatomical locations longitudinally. We internally and externally validated the model's performance in two randomly sampled cohorts of patients with advanced cancer.
The training cohort consisted of 715 patients and 14,013 annotated notes with ≥1 VTE keyword from the Harris Health System (HHS). The internal validation cohort included 400 additional patients with 7,190 VTE keyword-containing notes from HHS. The external validation cohort included 400 patients with 7,371 VTE keyword-containing notes from the National Veterans Affairs Healthcare System. VTE-BERT was trained until it reached a precision of 95% and recall of 98% on the patient level. Using independent datasets, the model achieved precision and recall of 95% and 91% in internal validation and of 85% and 92% in external validation.
We trained and externally validated an efficient NLP model to detect incident VTE events longitudinally. We believe its adoption will accelerate thrombosis research by improving VTE detection at scale and decreasing the time and expense involved with manual chart review in big data epidemiological studies.
在纵向研究中准确、快速地表征静脉血栓栓塞症(VTE)很重要。目前缺乏在代表性患者中进行外部验证的自然语言处理(NLP)工具。
我们设计了一种新型的NLP平台NLPMed,以协助血栓形成研究人员进行数据预处理、表型注释、语言模型微调及NLP应用。利用来自两个医疗机构的癌症患者的临床记录、出院小结和放射学报告,我们对Bio_ClinicalBERT进行微调以开发VTE-BERT。训练新模型以纵向检测急性VTE事件及其解剖位置。我们在两个随机抽样的晚期癌症患者队列中对该模型的性能进行了内部和外部验证。
训练队列包括来自哈里斯健康系统(HHS)的715例患者和14,013份带有≥1个VTE关键词的注释记录。内部验证队列包括另外400例来自HHS的患者及7,190份包含VTE关键词的记录。外部验证队列包括来自美国退伍军人事务医疗系统的400例患者及7,371份包含VTE关键词的记录。VTE-BERT经过训练,在患者层面达到了95%的精确率和98%的召回率。使用独立数据集,该模型在内部验证中的精确率和召回率分别为95%和91%,在外部验证中的精确率和召回率分别为85%和92%。
我们训练并在外部验证了一种有效的NLP模型,用于纵向检测VTE事件。我们相信,采用该模型将通过大规模改进VTE检测以及减少大数据流行病学研究中人工查阅病历所涉及的时间和费用,从而加速血栓形成研究。