迈向高通量表型分析：从知识源中进行无偏自动特征提取与选择。

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.

作者信息

Yu Sheng, Liao Katherine P, Shaw Stanley Y, Gainer Vivian S, Churchill Susanne E, Szolovits Peter, Murphy Shawn N, Kohane Isaac S, Cai Tianxi

机构信息

Partners HealthCare Personalized Medicine, Boston, MA, USA Brigham and Women's Hospital, Boston, MA, USA Harvard Medical School, Boston, MA, USA

Brigham and Women's Hospital, Boston, MA, USA Harvard Medical School, Boston, MA, USA.

出版信息

J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.

DOI:10.1093/jamia/ocv034

PMID:25929596

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4986664/

Abstract

OBJECTIVE

Analysis of narrative (text) data from electronic health records (EHRs) can improve population-scale phenotyping for clinical and genetic research. Currently, selection of text features for phenotyping algorithms is slow and laborious, requiring extensive and iterative involvement by domain experts. This paper introduces a method to develop phenotyping algorithms in an unbiased manner by automatically extracting and selecting informative features, which can be comparable to expert-curated ones in classification accuracy.

MATERIALS AND METHODS

Comprehensive medical concepts were collected from publicly available knowledge sources in an automated, unbiased fashion. Natural language processing (NLP) revealed the occurrence patterns of these concepts in EHR narrative notes, which enabled selection of informative features for phenotype classification. When combined with additional codified features, a penalized logistic regression model was trained to classify the target phenotype.

RESULTS

The authors applied our method to develop algorithms to identify patients with rheumatoid arthritis and coronary artery disease cases among those with rheumatoid arthritis from a large multi-institutional EHR. The area under the receiver operating characteristic curves (AUC) for classifying RA and CAD using models trained with automated features were 0.951 and 0.929, respectively, compared to the AUCs of 0.938 and 0.929 by models trained with expert-curated features.

DISCUSSION

Models trained with NLP text features selected through an unbiased, automated procedure achieved comparable or slightly higher accuracy than those trained with expert-curated features. The majority of the selected model features were interpretable.

CONCLUSION

The proposed automated feature extraction method, generating highly accurate phenotyping algorithms with improved efficiency, is a significant step toward high-throughput phenotyping.

摘要

目的

分析电子健康记录（EHR）中的叙述性（文本）数据可改善临床和基因研究的群体规模表型分析。目前，为表型分析算法选择文本特征既缓慢又费力，需要领域专家广泛且反复地参与。本文介绍了一种以无偏方式开发表型分析算法的方法，即自动提取和选择信息性特征，其在分类准确性方面可与专家策划的特征相媲美。

材料与方法

以自动化、无偏的方式从公开可用的知识源中收集综合医学概念。自然语言处理（NLP）揭示了这些概念在EHR叙述性记录中的出现模式，从而能够选择用于表型分类的信息性特征。当与其他编码特征相结合时，训练一个惩罚逻辑回归模型来对目标表型进行分类。

结果

作者应用我们的方法开发算法，以从大型多机构EHR中识别类风湿性关节炎患者以及类风湿性关节炎患者中的冠状动脉疾病病例。使用自动特征训练的模型对类风湿性关节炎和冠心病进行分类的受试者工作特征曲线下面积（AUC）分别为0.951和0.929，而使用专家策划特征训练的模型的AUC分别为0.938和0.929。

讨论

通过无偏、自动化程序选择的NLP文本特征训练的模型，其准确性与使用专家策划特征训练的模型相当或略高。所选模型特征中的大多数是可解释的。

结论

所提出的自动特征提取方法，生成了具有更高效率的高精度表型分析算法，是朝着高通量表型分析迈出的重要一步。

相似文献

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.迈向高通量表型分析：从知识源中进行无偏自动特征提取与选择。

J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Surrogate-assisted feature extraction for high-throughput phenotyping.用于高通量表型分析的代理辅助特征提取

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.评估电子健康记录数据源及识别高血压个体的算法方法。

J Am Med Inform Assoc. 2017 Jan;24(1):162-171. doi: 10.1093/jamia/ocw071. Epub 2016 Aug 7.

Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。

J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.

High-throughput multimodal automated phenotyping (MAP) with application to PheWAS.高通量多模态自动化表型分析 (MAP) 在 pheWAS 中的应用。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1255-1262. doi: 10.1093/jamia/ocz066.

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.ARCH：通过聚合叙事编码健康记录分析构建大规模知识图谱

medRxiv. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955.

Development of an automated phenotyping algorithm for hepatorenal syndrome.开发用于肝肾综合征的自动表型算法。

J Biomed Inform. 2018 Apr;80:87-95. doi: 10.1016/j.jbi.2018.03.001. Epub 2018 Mar 9.

Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis.类风湿关节炎的单纯电子健康记录表型识别

AMIA Annu Symp Proc. 2011;2011:189-96. Epub 2011 Oct 22.

PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records.PheMap：一个用于电子健康记录中高通量表型分析的多资源知识库。

J Am Med Inform Assoc. 2020 Nov 1;27(11):1675-1687. doi: 10.1093/jamia/ocaa104.

引用本文的文献

Associations of APOE variants with sphingomyelin and cholesterol metabolites across the life-course in diverse populations.不同人群中载脂蛋白E（APOE）变体与鞘磷脂和胆固醇代谢物在生命历程中的关联。

Metabolomics. 2025 May 7;21(3):64. doi: 10.1007/s11306-025-02256-w.

Clinical concept annotation with contextual word embedding in active transfer learning environment.主动迁移学习环境下基于上下文词嵌入的临床概念标注

Digit Health. 2024 Dec 19;10:20552076241308987. doi: 10.1177/20552076241308987. eCollection 2024 Jan-Dec.

Rare variant analyses in 51,256 type 2 diabetes cases and 370,487 controls reveal the pathogenicity spectrum of monogenic diabetes genes.51256 例 2 型糖尿病病例和 370487 例对照的罕见变异分析揭示了单基因糖尿病基因的致病性谱。

Nat Genet. 2024 Nov;56(11):2370-2379. doi: 10.1038/s41588-024-01947-9. Epub 2024 Oct 8.

Protocol for Designing a Model to Predict the Likelihood of Psychosis From Electronic Health Records Using Natural Language Processing and Machine Learning.使用自然语言处理和机器学习从电子健康记录中设计预测精神病可能性模型的方案

Perm J. 2024 Sep 16;28(3):23-36. doi: 10.7812/TPP/23.139. Epub 2024 Sep 2.

Machine learning-derived phenotypic trajectories of asthma and allergy in children and adolescents: protocol for a systematic review.机器学习衍生的儿童和青少年哮喘和过敏表型轨迹：系统评价方案。

BMJ Open. 2024 Aug 30;14(8):e080263. doi: 10.1136/bmjopen-2023-080263.

Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source.基于机器学习的大型索赔数据源中先天性心脏病诊断

JACC Adv. 2023 Dec 25;3(2):100801. doi: 10.1016/j.jacadv.2023.100801. eCollection 2024 Feb.

The association of TNF inhibitor use with incident cardiovascular events in radiographic axial spondyloarthritis.TNF 抑制剂的使用与放射学中轴型脊柱关节炎患者心血管事件的发生有关。

Semin Arthritis Rheum. 2024 Oct;68:152482. doi: 10.1016/j.semarthrit.2024.152482. Epub 2024 Jun 2.

A general framework for developing computable clinical phenotype algorithms.开发可计算临床表型算法的一般框架。

J Am Med Inform Assoc. 2024 Aug 1;31(8):1785-1796. doi: 10.1093/jamia/ocae121.

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.半监督 ROC 分析用于可靠且精简的表型算法评估。

J Am Med Inform Assoc. 2024 Feb 16;31(3):640-650. doi: 10.1093/jamia/ocad226.

Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease.用于急性健康状况的数据驱动自动分类算法：将PheNorm应用于COVID-19疾病

J Am Med Inform Assoc. 2024 Feb 16;31(3):574-582. doi: 10.1093/jamia/ocad241.

本文引用的文献

Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data.系统比较电子病历数据的表型全基因组关联研究和全基因组关联研究数据。

Nat Biotechnol. 2013 Dec;31(12):1102-10. doi: 10.1038/nbt.2749.

Modeling disease severity in multiple sclerosis using electronic health records.利用电子健康记录对多发性硬化症的疾病严重程度进行建模。

PLoS One. 2013 Nov 11;8(11):e78927. doi: 10.1371/journal.pone.0078927. eCollection 2013.

Improving case definition of Crohn's disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach.利用自然语言处理改善电子病历中克罗恩病和溃疡性结肠炎的病例定义：一种新的信息学方法。

Inflamm Bowel Dis. 2013 Jun;19(7):1411-20. doi: 10.1097/MIB.0b013e31828133fd.

Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.基于电子病历的表型算法验证：eMERGE 网络的结果和经验教训。

J Am Med Inform Assoc. 2013 Jun;20(e1):e147-54. doi: 10.1136/amiajnl-2012-000896. Epub 2013 Mar 26.

Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk.基因组和表型全基因组分析发现心脏传导标志物与心律失常风险相关。

Circulation. 2013 Apr 2;127(13):1377-85. doi: 10.1161/CIRCULATIONAHA.112.000604. Epub 2013 Mar 5.

Comparative effectiveness research using electronic health records: impacts of oral antidiabetic drugs on the development of chronic kidney disease.利用电子健康记录进行的比较疗效研究：口服降糖药物对慢性肾脏病发展的影响。

Pharmacoepidemiol Drug Saf. 2013 Apr;22(4):413-22. doi: 10.1002/pds.3413. Epub 2013 Feb 24.

QT interval and antidepressant use: a cross sectional study of electronic health records.QT 间期与抗抑郁药物使用：电子健康记录的横断面研究。

BMJ. 2013 Jan 29;346:f288. doi: 10.1136/bmj.f288.

Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls.类风湿关节炎病例和非类风湿关节炎对照中电子病历里自身抗体、自身免疫风险等位基因与临床诊断之间的关联。

Arthritis Rheum. 2013 Mar;65(3):571-81. doi: 10.1002/art.37801.

Empirical assessment of methods for risk identification in healthcare data: results from the experiments of the Observational Medical Outcomes Partnership.医疗保健数据中风险识别方法的实证评估：观察性医疗结局伙伴关系实验的结果。

Stat Med. 2012 Dec 30;31(30):4401-15. doi: 10.1002/sim.5620. Epub 2012 Sep 27.

Pneumonia identification using statistical feature selection.使用统计特征选择进行肺炎识别。

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):817-23. doi: 10.1136/amiajnl-2011-000752. Epub 2012 Apr 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验