Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.
Nat Protoc. 2019 Dec;14(12):3426-3444. doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.
Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).
表型是疾病风险和结果的临床和遗传研究的基础。与电子病历 (EMR) 数据相关的生物库的增长既促进了对高效、准确和强大的方法的需求,也增加了对这些方法的需求,以便对数百万患者进行表型分析。使用 EMR 数据进行表型分析的挑战包括代码准确性的差异,以及为算法识别特征和获得金标准标签所需的大量手动输入。为了解决这些挑战,我们开发了 PheCAP,这是一种高通量的半监督表型分析管道。PheCAP 从 EMR 中的数据开始,包括使用自然语言处理 (NLP) 从叙述性注释中提取的结构化数据和信息。标准化步骤集成了自动化程序,从而减少了手动输入的程度,并为算法训练提供了机器学习方法。如果所有数据都可用,PheCAP 本身可以在 1-2 天内执行;但是,时间主要取决于图表审查阶段,该阶段通常至少需要 2 周。PheCAP 的最终产品包括一个表型算法、所有患者的表型概率以及表型分类(是或否)。