一系列用于从电子健康记录预测肿瘤反应评估和生存曲线的自然语言处理。
A series of natural language processing for predicting tumor response evaluation and survival curve from electronic health records.
作者信息
Takeuchi Toshiki, Horinouchi Hidehito, Takasawa Ken, Mukai Masami, Masuda Ken, Shinno Yuki, Okuma Yusuke, Yoshida Tatsuya, Goto Yasushi, Yamamoto Noboru, Ohe Yuichiro, Miyake Mototaka, Watanabe Hirokazu, Kusumoto Masahiko, Aoki Takashi, Nishimura Kunihiro, Hamamoto Ryuji
机构信息
, Xcoo, Inc. Hongo-Sanchome TH Bldg., 6F, 2-40-8, Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan.
Department of Thoracic Oncology, National Cancer Center Hospital, 5-1-1 Tsukiji, Chuo-Ku, Tokyo, 104-0045, Japan.
出版信息
BMC Med Inform Decis Mak. 2025 Feb 17;25(1):85. doi: 10.1186/s12911-025-02928-6.
BACKGROUND
The clinical information housed within unstructured electronic health records (EHRs) has the potential to promote cancer research. The National Cancer Center Hospital (NCCH) is widely recognized as a leading institution for the treatment of thoracic malignancies in Japan. Information on medical treatment, particularly the characteristics of malignant tumors that occur in patients, tumor response evaluation, and adverse events, was compiled into the databases of each NCCH department from EHRs. However, there have been few opportunities for integrated analysis of data on both the hospital and research institute.
METHODS
We developed a method for predicting tumor response evaluation and survival curves of drug therapy from the EHRs of lung cancer patients using natural language processing. First, we developed a rule-based algorithm to predict treatment duration using a dictionary of anticancer drugs and regimens used for lung cancer treatment. Thereafter, we applied supervised learning to radiology reports during each treatment period and constructed a classification model to predict the tumor response evaluation of anticancer drugs and date when the progressive disease (PD) was determined. The predicted response and PD date can be used to draw a survival curve for the progression-free survival.
RESULTS
We used the EHRs of 716 lung cancer treatments at the NCCH and structured data of the cases as labels for the training and testing of supervised learning. The structured data were manually curated by physicians and CRCs. We investigated the results and performance of the proposed method. Individual predictions of tumor response evaluation and PD date were not extremely high. However, the final predicted survival curves were nearly similar to the actual survival curves.
CONCLUSIONS
Although it is difficult to construct a fully automated system using our method, we believe that it achieves sufficient performance for supporting physicians and CRCs constructing the database and providing clinical information to help researchers find out a chance of clinical studies.
背景
非结构化电子健康记录(EHR)中所包含的临床信息具有促进癌症研究的潜力。日本国立癌症中心医院(NCCH)被广泛认为是治疗胸部恶性肿瘤的领先机构。医疗信息,特别是患者所患恶性肿瘤的特征、肿瘤反应评估和不良事件,已从EHR汇编到NCCH各科室的数据库中。然而,对医院和研究所的数据进行综合分析的机会很少。
方法
我们开发了一种利用自然语言处理从肺癌患者的EHR预测药物治疗的肿瘤反应评估和生存曲线的方法。首先,我们开发了一种基于规则的算法,使用用于肺癌治疗的抗癌药物和治疗方案词典来预测治疗持续时间。此后,我们在每个治疗期间对放射学报告应用监督学习,并构建了一个分类模型,以预测抗癌药物的肿瘤反应评估和确定疾病进展(PD)的日期。预测的反应和PD日期可用于绘制无进展生存期的生存曲线。
结果
我们使用了NCCH的716例肺癌治疗的EHR,并将病例的结构化数据作为监督学习训练和测试的标签。结构化数据由医生和临床研究协调员(CRC)人工整理。我们研究了所提出方法的结果和性能。肿瘤反应评估和PD日期的个体预测并不非常高。然而,最终预测的生存曲线与实际生存曲线几乎相似。
结论
虽然使用我们的方法构建一个全自动系统很困难,但我们相信它在支持医生和CRC构建数据库以及提供临床信息以帮助研究人员找到临床研究机会方面具有足够的性能。