Chiu Po-Hsiang, Hripcsak George
Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA.
J Biomed Inform. 2017 Jun;70:35-51. doi: 10.1016/j.jbi.2017.04.009. Epub 2017 Apr 12.
In data-driven phenotyping, a core computational task is to identify medical concepts and their variations from sources of electronic health records (EHR) to stratify phenotypic cohorts. A conventional analytic framework for phenotyping largely uses a manual knowledge engineering approach or a supervised learning approach where clinical cases are represented by variables encompassing diagnoses, medicinal treatments and laboratory tests, among others. In such a framework, tasks associated with feature engineering and data annotation remain a tedious and expensive exercise, resulting in poor scalability. In addition, certain clinical conditions, such as those that are rare and acute in nature, may never accumulate sufficient data over time, which poses a challenge to establishing accurate and informative statistical models. In this paper, we use infectious diseases as the domain of study to demonstrate a hierarchical learning method based on ensemble learning that attempts to address these issues through feature abstraction. We use a sparse annotation set to train and evaluate many phenotypes at once, which we call bulk learning. In this batch-phenotyping framework, disease cohort definitions can be learned from within the abstract feature space established by using multiple diseases as a substrate and diagnostic codes as surrogates. In particular, using surrogate labels for model training renders possible its subsequent evaluation using only a sparse annotated sample. Moreover, statistical models can be trained and evaluated, using the same sparse annotation, from within the abstract feature space of low dimensionality that encapsulates the shared clinical traits of these target diseases, collectively referred to as the bulk learning set.
在数据驱动的表型分析中,一个核心计算任务是从电子健康记录(EHR)源中识别医学概念及其变体,以对表型队列进行分层。传统的表型分析框架主要使用手动知识工程方法或监督学习方法,其中临床病例由包括诊断、药物治疗和实验室检查等变量表示。在这样的框架中,与特征工程和数据注释相关的任务仍然是一项繁琐且昂贵的工作,导致可扩展性较差。此外,某些临床病症,例如那些罕见且急性的病症,可能永远无法随着时间积累足够的数据,这对建立准确且信息丰富的统计模型构成了挑战。在本文中,我们以传染病作为研究领域,展示一种基于集成学习的分层学习方法,该方法试图通过特征抽象来解决这些问题。我们使用一个稀疏注释集来一次性训练和评估多个表型,我们将其称为批量学习。在这个批量表型分析框架中,可以从以多种疾病为基础、诊断代码为替代物建立的抽象特征空间中学习疾病队列定义。特别是,使用替代标签进行模型训练使得随后仅使用稀疏注释样本进行评估成为可能。此外,可以在封装这些目标疾病共同临床特征的低维抽象特征空间内(统称为批量学习集),使用相同的稀疏注释来训练和评估统计模型。