Agarwal Vibhu, Podchiyska Tanya, Banda Juan M, Goel Veena, Leung Tiffany I, Minty Evan P, Sweeney Timothy E, Gyang Elsie, Shah Nigam H
Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA
Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.
J Am Med Inform Assoc. 2016 Nov;23(6):1166-1173. doi: 10.1093/jamia/ocw028. Epub 2016 May 12.
Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.
We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.
Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.
Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.
传统上,具有某种表型的患者群体是通过基于规则的定义来选择的,这些定义的创建和验证耗时较长。电子表型分析的机器学习方法受到标记训练数据集匮乏的限制。我们证明了利用半自动标记的训练集通过机器学习创建表型模型的可行性,该模型使用了患者病历的全面表示。
我们使用特定于感兴趣表型的关键词列表来生成有噪声的标记训练数据。我们针对一种慢性疾病和一种急性疾病训练L1惩罚逻辑回归模型,并根据金标准评估模型的性能。
我们的2型糖尿病和心肌梗死模型的精确率和准确率分别达到0.90、0.89和0.86、0.89。先前验证的2型糖尿病和心肌梗死基于规则定义的本地实现的精确率和准确率分别为0.96、0.92和0.84、0.87。我们已经证明了使用标记不完美的数据学习慢性和急性表型的表型模型的可行性。在特征工程和关键词列表规范方面的进一步研究可以提高模型的性能和该方法的可扩展性。
我们的方法为创建表型统计模型的训练集提供了一种替代手动标记的方法。这种方法可以加速对大型观察性医疗保健数据集的研究,也可用于创建本地表型模型。