Zhang Yichi, Liu Molei, Neykov Matey, Cai Tianxi
Department of Computer Science and Statistics, University of Rhode Island.
Department of Biostatistics, Harvard T.H. Chan School of Public Health.
J Mach Learn Res. 2022;23.
Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, , is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label and the feature set are observed) and a much larger, weakly-labeled dataset in which the feature set is accompanied only by a surrogate label that is available to all patients. Under a prior assumption that is related to only through and allowing it to hold , we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.
电子健康记录(EHR)数据是生物医学研究的丰富来源,已成功用于深入了解多种疾病。尽管具有潜力,但由于缺乏精确的表型信息这一主要限制,EHR目前在发现性研究中的利用不足。为克服这些困难,最近的努力致力于开发监督算法,以便基于通过图表审查提取的具有金标准标签的相对较小的训练数据集准确预测表型。然而,监督方法通常需要相当大的训练集才能产生可推广的算法,特别是当候选特征的数量(p)很大时。在本文中,我们提出了一种半监督(SS)EHR表型分析方法,该方法从一个小的、有标签的数据集(其中标签(Y)和特征集(X)都可观察到)和一个大得多的、弱标签数据集借用信息,在该弱标签数据集中,特征集(X)仅伴随着所有患者都可用的替代标签(Z)。在(Y)仅通过(X)与(Z)相关且允许其成立的先验假设下,我们提出了一种先验自适应半监督(PASS)估计器,该估计器通过将估计器朝着在先验下导出的方向收缩来纳入先验知识。我们推导了所提出估计器的渐近理论,并证明了其对质量较差的先验信息的效率和稳健性。我们还通过模拟研究以及在一家大型三级医院进行的三项真实世界EHR表型分析研究,证明了它在各种情况下优于现有估计器。