用于表型分层的电子健康记录的半监督学习

Semi-supervised learning of the electronic health record for phenotype stratification.

作者信息

Beaulieu-Jones Brett K, Greene Casey S

机构信息

Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, United States; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, United States.

Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, United States; Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Perelman School of Medicine, University of Pennsylvania, United States.

出版信息

J Biomed Inform. 2016 Dec;64:168-178. doi: 10.1016/j.jbi.2016.10.007. Epub 2016 Oct 12.

Abstract

Patient interactions with health care providers result in entries to electronic health records (EHRs). EHRs were built for clinical and billing purposes but contain many data points about an individual. Mining these records provides opportunities to extract electronic phenotypes, which can be paired with genetic data to identify genes underlying common human diseases. This task remains challenging: high quality phenotyping is costly and requires physician review; many fields in the records are sparsely filled; and our definitions of diseases are continuing to improve over time. Here we develop and evaluate a semi-supervised learning method for EHR phenotype extraction using denoising autoencoders for phenotype stratification. By combining denoising autoencoders with random forests we find classification improvements across multiple simulation models and improved survival prediction in ALS clinical trial data. This is particularly evident in cases where only a small number of patients have high quality phenotypes, a common scenario in EHR-based research. Denoising autoencoders perform dimensionality reduction enabling visualization and clustering for the discovery of new subtypes of disease. This method represents a promising approach to clarify disease subtypes and improve genotype-phenotype association studies that leverage EHRs.

摘要

患者与医疗服务提供者的互动会生成电子健康记录(EHR)中的条目。EHR是为临床和计费目的而建立的,但包含了关于个人的许多数据点。挖掘这些记录为提取电子表型提供了机会,这些电子表型可与遗传数据配对,以识别常见人类疾病背后的基因。这项任务仍然具有挑战性:高质量的表型分析成本高昂且需要医生审核;记录中的许多字段填写稀疏;而且我们对疾病的定义也在不断随着时间改进。在这里,我们开发并评估了一种用于EHR表型提取的半监督学习方法,该方法使用去噪自动编码器进行表型分层。通过将去噪自动编码器与随机森林相结合,我们发现在多个模拟模型中分类得到了改进,并且在ALS临床试验数据中的生存预测也得到了改善。在只有少数患者具有高质量表型的情况下,这一点尤为明显,这是基于EHR的研究中的常见情况。去噪自动编码器执行降维,从而实现可视化和聚类,以发现疾病的新亚型。这种方法是一种很有前景的方法,可用于阐明疾病亚型并改善利用EHR的基因型-表型关联研究。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索