IEEE J Biomed Health Inform. 2023 Dec;27(12):6062-6073. doi: 10.1109/JBHI.2023.3324191. Epub 2023 Dec 5.
Electronic claims records (ECRs) are large scale and longitudinal collections of individual's medical service seeking actions. Compared to in-hospital medical records (EMRs), ECRs are more standardized and cross-sites. Recently, there has been studies showing promising results on modeling claims data for a wide range of medical applications. However, few of them address the exclusion criteria on cohort selection to extract new incidence without prior signs and also often lack of emphasis on predicting cancer in early stages. In this work, we aim to design a lung cancer prediction framework using ECRs with rigorous exclusion design using state-of-the-art sequence-based transformer. Furthermore, this work presents one of the first results by applying disease prediction model to the entire population in Taiwan. The result shows over 2.1 predictive power, 5 average positive predictive value (PPV), and 0.668 area under curve (AUC) in all-stage lung cancer and around 2.0 predictive power, 1 average PPV and 0.645 AUC in early-stage in our dataset. Sub-cohort analysis could funnel high precision selective group into prioritized clinical examination. Onset analysis validates the effect of our exclusion criteria. This work presents comprehensive analyses on lung cancer prediction, and the proposed approach can serve as a state-of-the-art disease risk prediction framework on claims data.
电子索赔记录 (ECR) 是个人医疗服务寻求行为的大规模、纵向数据集。与住院病历 (EMR) 相比,ECR 更加标准化和跨站点。最近,有研究表明,在对广泛的医疗应用进行索赔数据建模方面取得了有希望的结果。然而,它们很少涉及到在不预先有迹象的情况下提取新发病率的队列选择排除标准,也往往缺乏对早期癌症预测的重视。在这项工作中,我们旨在使用最先进的基于序列的转换器,通过严格的排除设计,使用 ECR 来设计肺癌预测框架。此外,这项工作首次应用疾病预测模型对台湾的全部人口进行了分析。结果表明,在我们的数据集的所有阶段肺癌中,预测能力超过 2.1,平均阳性预测值 (PPV) 为 5,曲线下面积 (AUC) 为 0.668,早期阶段肺癌的预测能力约为 2.0,平均 PPV 为 1,AUC 为 0.645。子队列分析可以将高精度选择性群体纳入优先临床检查。发病分析验证了我们排除标准的效果。这项工作对肺癌预测进行了全面分析,所提出的方法可以作为索赔数据中疾病风险预测的最新框架。