使用带有噪声标签的训练数据学习表型的统计模型。

Learning statistical models of phenotypes using noisy labeled training data.

作者信息

Agarwal Vibhu, Podchiyska Tanya, Banda Juan M, Goel Veena, Leung Tiffany I, Minty Evan P, Sweeney Timothy E, Gyang Elsie, Shah Nigam H

机构信息

Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA

Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.

出版信息

J Am Med Inform Assoc. 2016 Nov;23(6):1166-1173. doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

DOI:10.1093/jamia/ocw028

PMID:27174893

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5070523/

Abstract

OBJECTIVE

Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.

METHODS

We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.

RESULTS

Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.

CONCLUSIONS

Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.

摘要

目的

传统上，具有某种表型的患者群体是通过基于规则的定义来选择的，这些定义的创建和验证耗时较长。电子表型分析的机器学习方法受到标记训练数据集匮乏的限制。我们证明了利用半自动标记的训练集通过机器学习创建表型模型的可行性，该模型使用了患者病历的全面表示。

方法

我们使用特定于感兴趣表型的关键词列表来生成有噪声的标记训练数据。我们针对一种慢性疾病和一种急性疾病训练L1惩罚逻辑回归模型，并根据金标准评估模型的性能。

结果

我们的2型糖尿病和心肌梗死模型的精确率和准确率分别达到0.90、0.89和0.86、0.89。先前验证的2型糖尿病和心肌梗死基于规则定义的本地实现的精确率和准确率分别为0.96、0.92和0.84、0.87。我们已经证明了使用标记不完美的数据学习慢性和急性表型的表型模型的可行性。在特征工程和关键词列表规范方面的进一步研究可以提高模型的性能和该方法的可扩展性。

结论

我们的方法为创建表型统计模型的训练集提供了一种替代手动标记的方法。这种方法可以加速对大型观察性医疗保健数据集的研究，也可用于创建本地表型模型。

相似文献

Learning statistical models of phenotypes using noisy labeled training data.

J Am Med Inform Assoc. 2016 Nov;23(6):1166-1173. doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

Weakly Semi-supervised phenotyping using Electronic Health records.

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

Automated feature selection of predictors in electronic medical records data.

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Surrogate-assisted feature extraction for high-throughput phenotyping.

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Development and validation of phenotype classifiers across multiple sites in the observational health data sciences and informatics network.

J Am Med Inform Assoc. 2020 Jun 1;27(6):877-883. doi: 10.1093/jamia/ocaa032.

A machine learning-based framework to identify type 2 diabetes through electronic health records.

Int J Med Inform. 2017 Jan;97:120-127. doi: 10.1016/j.ijmedinf.2016.09.014. Epub 2016 Oct 1.

A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

Relational machine learning for electronic health record-driven phenotyping.

J Biomed Inform. 2014 Dec;52:260-70. doi: 10.1016/j.jbi.2014.07.007. Epub 2014 Jul 15.

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions.

Neural Comput Appl. 2021 Oct 29:1-9. doi: 10.1007/s00521-021-06614-2.

Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study.

J Biomed Inform. 2016 Apr;60:162-8. doi: 10.1016/j.jbi.2015.12.006. Epub 2015 Dec 17.

引用本文的文献

Predictive Models Using Machine Learning to Identify Fetal Growth Restriction in Patients With Preeclampsia: Development and Evaluation Study.

J Med Internet Res. 2025 May 27;27:e70068. doi: 10.2196/70068.

Leveraging undecided cases in chart-reviewed phenotypes to enhance EHR-based association studies.

J Biomed Inform. 2025 Jun;166:104839. doi: 10.1016/j.jbi.2025.104839. Epub 2025 Apr 30.

Clinical Research Informatics: a Decade-in-Review.

Yearb Med Inform. 2024 Aug;33(1):127-142. doi: 10.1055/s-0044-1800732. Epub 2025 Apr 8.

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.

J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

Towards automated phenotype definition extraction using large language models.

Genomics Inform. 2024 Oct 31;22(1):21. doi: 10.1186/s44342-024-00023-2.

Use of noisy labels as weak learners to identify incompletely ascertainable outcomes: A Feasibility study with opioid-induced respiratory depression.

Heliyon. 2024 Feb 16;10(5):e26434. doi: 10.1016/j.heliyon.2024.e26434. eCollection 2024 Mar 15.

Machine learning to identify chronic cough from administrative claims data.

Sci Rep. 2024 Jan 30;14(1):2449. doi: 10.1038/s41598-024-51522-9.

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.

J Am Med Inform Assoc. 2024 Feb 16;31(3):640-650. doi: 10.1093/jamia/ocad226.

Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease.

J Am Med Inform Assoc. 2024 Feb 16;31(3):574-582. doi: 10.1093/jamia/ocad241.

Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review.

JMIR Med Inform. 2023 Dec 15;11:e42477. doi: 10.2196/42477.

本文引用的文献

Electronic medical record phenotyping using the anchor and learn framework.

J Am Med Inform Assoc. 2016 Jul;23(4):731-40. doi: 10.1093/jamia/ocw011. Epub 2016 Apr 23.

Personalized Predictive Modeling and Risk Factor Identification using Patient Similarity.

AMIA Jt Summits Transl Sci Proc. 2015 Mar 25;2015:132-6. eCollection 2015.

Extracting research-quality phenotypes from electronic health records to support precision medicine.

Genome Med. 2015 Apr 30;7(1):41. doi: 10.1186/s13073-015-0166-y. eCollection 2015.

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.

J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.

Intelligent use and clinical benefits of electronic health records in rheumatoid arthritis.

Expert Rev Clin Immunol. 2015 Mar;11(3):329-37. doi: 10.1586/1744666X.2015.1009895. Epub 2015 Feb 8.

Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptibility loci.

Mol Vis. 2014 Sep 19;20:1281-95. eCollection 2014.

Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record.

J Am Med Inform Assoc. 2015 Apr;22(e1):e151-61. doi: 10.1136/amiajnl-2014-002642. Epub 2014 Oct 25.

Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

J Am Med Inform Assoc. 2015 Jan;22(1):121-31. doi: 10.1136/amiajnl-2014-002902. Epub 2014 Oct 21.

Evaluation of matched control algorithms in EHR-based phenotyping studies: a case study of inflammatory bowel disease comorbidities.

J Biomed Inform. 2014 Dec;52:105-11. doi: 10.1016/j.jbi.2014.08.012. Epub 2014 Sep 6.

Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records.

Hum Genet. 2014 Nov;133(11):1369-82. doi: 10.1007/s00439-014-1466-9. Epub 2014 Jul 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用带有噪声标签的训练数据学习表型的统计模型。

Learning statistical models of phenotypes using noisy labeled training data.

作者信息

Agarwal Vibhu, Podchiyska Tanya, Banda Juan M, Goel Veena, Leung Tiffany I, Minty Evan P, Sweeney Timothy E, Gyang Elsie, Shah Nigam H

机构信息

Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA

Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.

出版信息

J Am Med Inform Assoc. 2016 Nov;23(6):1166-1173. doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

使用带有噪声标签的训练数据学习表型的统计模型。

Learning statistical models of phenotypes using noisy labeled training data.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

使用带有噪声标签的训练数据学习表型的统计模型。

Learning statistical models of phenotypes using noisy labeled training data.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论