Jian Xinyao, Zhang Dazheng, Yu Zehao, Xu Hua, Bian Jiang, Wu Yonghui, Tong Jiayi, Chen Yong
The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA.
Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
J Biomed Inform. 2025 Jun;166:104839. doi: 10.1016/j.jbi.2025.104839. Epub 2025 Apr 30.
In electronic health record (EHR)-based association studies, phenotyping algorithms efficiently classify patient clinical outcomes into binary categories but are susceptible to misclassification errors. The gold standard, manual chart review, involves clinicians determining the true disease status based on their assessment of health records. These clinicians-labeled phenotypes are labor-intensive and typically limited to a small subset of patients, potentially introducing a third "undecided" category when phenotypes are indeterminate. We aim to effectively integrate the algorithm-derived and chart-reviewed outcomes when both are available in EHR-based association studies.
We propose an augmented estimation method that combines the binary algorithm-derived phenotypes for the entire cohort with the trinary chart-reviewed phenotypes for a small, selected subset. Additionally, a cost-effective outcome-dependent sampling strategy is used to address the rare disease scenarios. The proposed trinary chart-reviewed phenotype integrated cost-effective augmented estimation (TriCA) was evaluated across a wide range of simulation settings and real-world applications, including using EHR data on Alzheimer's disease and related dementias (ADRD) from the OneFlorida + Clinical Research Network, and using cohort data on second breast cancer events (SBCE) from the Kaiser Permanente Washington.
Compared to estimation based on random sampling, our augmented method improved mean square error by up to 28.3% in simulation studies; compared to estimation using only trinary chart-reviewed phenotypes, our method improved efficiency by up to 33.3% in ADRD data and 50.8% in SBCE data.
Our simulation studies and real-world applications demonstrate that, compared to existing methods, the proposed method provides unbiased estimates with higher statistical efficiency.
The proposed method effectively combined binary algorithm-derived phenotypes for the whole cohort with trinary chart-reviewed outcomes for a limited validation set, making it applicable to a broader range of applications and enhancing risk factor identification in EHR-based association studies.
在基于电子健康记录(EHR)的关联研究中,表型分析算法可有效地将患者临床结局分类为二元类别,但容易出现错误分类。金标准是人工病历审查,即临床医生根据对健康记录的评估来确定真实疾病状态。这些临床医生标记的表型需要耗费大量人力,并且通常仅限于一小部分患者,当表型不确定时可能会引入第三个“未决”类别。我们旨在当基于EHR的关联研究中同时有算法得出的结果和病历审查结果时,有效地整合这两种结果。
我们提出一种增强估计方法,该方法将整个队列中基于算法得出的二元表型与一小部分选定子集中经过病历审查的三元表型相结合。此外,还使用了一种具有成本效益的依赖于结局的抽样策略来处理罕见病情况。所提出的经过病历审查的三元表型整合成本效益增强估计(TriCA)方法在广泛的模拟设置和实际应用中进行了评估,包括使用来自OneFlorida + 临床研究网络的阿尔茨海默病及相关痴呆症(ADRD)的EHR数据,以及使用来自凯撒永久医疗集团华盛顿分部的第二次乳腺癌事件(SBCE)队列数据。
与基于随机抽样的估计相比,我们的增强方法在模拟研究中将均方误差提高了28.3%;与仅使用经过病历审查的三元表型进行估计相比,我们的方法在ADRD数据中效率提高了33.3%,在SBCE数据中效率提高了50.8%。
我们的模拟研究和实际应用表明,与现有方法相比,所提出的方法能提供具有更高统计效率的无偏估计。
所提出的方法有效地将整个队列中基于算法得出的二元表型与有限验证集中经过病历审查的三元结果相结合,使其适用于更广泛的应用,并增强了基于EHR的关联研究中的危险因素识别。