• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

队列设计与自然语言处理以减少电子健康记录研究中的偏倚

Cohort design and natural language processing to reduce bias in electronic health records research.

作者信息

Khurshid Shaan, Reeder Christopher, Harrington Lia X, Singh Pulkit, Sarma Gopal, Friedman Samuel F, Di Achille Paolo, Diamant Nathaniel, Cunningham Jonathan W, Turner Ashby C, Lau Emily S, Haimovich Julian S, Al-Alusi Mostafa A, Wang Xin, Klarqvist Marcus D R, Ashburner Jeffrey M, Diedrich Christian, Ghadessi Mercedeh, Mielke Johanna, Eilken Hanna M, McElhinney Alice, Derix Andrea, Atlas Steven J, Ellinor Patrick T, Philippakis Anthony A, Anderson Christopher D, Ho Jennifer E, Batra Puneet, Lubitz Steven A

机构信息

Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA.

Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA.

出版信息

NPJ Digit Med. 2022 Apr 8;5(1):47. doi: 10.1038/s41746-022-00590-0.

DOI:10.1038/s41746-022-00590-0
PMID:35396454
原文链接:
https://pmc.ncbi.nlm.nih.gov/articles/PMC8993873/
Abstract

Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95-0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012-0.030 in C3PO vs. 0.028-0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.

摘要

电子健康记录(EHR)数据集具有强大的统计功能,但存在确诊偏倚和数据缺失问题。利用麻省总医院布莱根分院的多机构电子健康记录,我们通过对2001年至2018年间接受纵向初级保健的患者进行抽样,近似得到了一个基于社区的队列(社区护理队列项目 [C3PO],n = 520,868)。我们利用自然语言处理(NLP)从非结构化记录中恢复生命体征。我们通过部署已建立的心肌梗死/中风和心房颤动风险模型来评估C3PO的有效性。然后,我们将C3PO与便利样本进行比较,便利样本包括来自同一电子健康记录的所有具有完整数据但无纵向初级保健要求的个体。自然语言处理将生命体征的缺失率降低了31%。通过自然语言处理恢复的生命体征与从结构化字段得出的值高度相关(皮尔逊r范围为0.95 - 0.99)。与便利样本相比,C3PO中的心房颤动和心肌梗死/中风发病率较低,风险模型校准效果更好(心肌梗死/中风的校准误差范围:C3PO中为0.012 - 0.030,便利样本中为0.028 - 0.046;心房颤动的校准误差,C3PO中为0.028,便利样本中为0.036)。对接受常规初级保健的患者进行抽样并使用自然语言处理来恢复缺失数据,可能会减少偏倚并使电子健康记录研究的可推广性最大化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/e93e7c084ce3/41746_2022_590_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/03d659422c53/41746_2022_590_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/0a1cfde38b42/41746_2022_590_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/5b1117b9e027/41746_2022_590_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/1e2c32a7a81a/41746_2022_590_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/0b40fb90ffeb/41746_2022_590_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/adf2f2f88988/41746_2022_590_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/aa3279f23f6f/41746_2022_590_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/e93e7c084ce3/41746_2022_590_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/03d659422c53/41746_2022_590_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/0a1cfde38b42/41746_2022_590_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/5b1117b9e027/41746_2022_590_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/1e2c32a7a81a/41746_2022_590_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/0b40fb90ffeb/41746_2022_590_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/adf2f2f88988/41746_2022_590_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/aa3279f23f6f/41746_2022_590_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b033/8993873/e93e7c084ce3/41746_2022_590_Fig8_HTML.jpg

相似文献

1
Cohort design and natural language processing to reduce bias in electronic health records research.队列设计与自然语言处理以减少电子健康记录研究中的偏倚
NPJ Digit Med. 2022 Apr 8;5(1):47. doi: 10.1038/s41746-022-00590-0.
2
Natural Language Processing for Adjudication of Heart Failure in a Multicenter Clinical Trial: A Secondary Analysis of a Randomized Clinical Trial.自然语言处理在多中心临床试验中心衰裁决中的应用:一项随机临床试验的二次分析。
JAMA Cardiol. 2024 Feb 1;9(2):174-181. doi: 10.1001/jamacardio.2023.4859.
3
Natural Language Processing for Adjudication of Heart Failure Hospitalizations in a Multi-Center Clinical Trial.多中心临床试验中用于判定心力衰竭住院情况的自然语言处理
medRxiv. 2023 Aug 23:2023.08.17.23294234. doi: 10.1101/2023.08.17.23294234.
4
Underserved populations with missing race ethnicity data differ significantly from those with structured race/ethnicity documentation.服务不足的人群中缺失种族民族数据与那些有结构化种族/民族文档记录的人群有显著差异。
J Am Med Inform Assoc. 2019 Aug 1;26(8-9):722-729. doi: 10.1093/jamia/ocz040.
5
Using Artificial Intelligence With Natural Language Processing to Combine Electronic Health Record's Structured and Free Text Data to Identify Nonvalvular Atrial Fibrillation to Decrease Strokes and Death: Evaluation and Case-Control Study.利用人工智能与自然语言处理相结合,整合电子健康记录的结构化和自由文本数据,以识别非瓣膜性心房颤动,从而降低中风和死亡风险:评估和病例对照研究。
J Med Internet Res. 2021 Nov 9;23(11):e28946. doi: 10.2196/28946.
6
Extracting forced vital capacity from the electronic health record through natural language processing in rheumatoid arthritis-associated interstitial lung disease.通过自然语言处理从电子健康记录中提取类风湿关节炎相关间质性肺病的用力肺活量。
Pharmacoepidemiol Drug Saf. 2024 Jan;33(1):e5744. doi: 10.1002/pds.5744. Epub 2023 Dec 19.
7
Leveraging Natural Language Processing to Improve Electronic Health Record Suicide Risk Prediction for Veterans Health Administration Users.利用自然语言处理提高退伍军人健康管理局用户电子健康记录自杀风险预测
J Clin Psychiatry. 2023 Jun 19;84(4):22m14568. doi: 10.4088/JCP.22m14568.
8
Extracting Cognitive Impairment Assessment Information From Unstructured Notes in Electronic Health Records Using Natural Language Processing Tools: Validation with Clinical Assessment Data.使用自然语言处理工具从电子健康记录中的非结构化笔记中提取认知障碍评估信息:与临床评估数据的验证
Clin Epidemiol. 2025 Apr 15;17:353-365. doi: 10.2147/CLEP.S504259. eCollection 2025.
9
Natural Language Processing to Improve Prediction of Incident Atrial Fibrillation Using Electronic Health Records.自然语言处理改善基于电子健康记录预测房颤事件
J Am Heart Assoc. 2022 Aug 2;11(15):e026014. doi: 10.1161/JAHA.122.026014. Epub 2022 Jul 29.
10
Using natural language processing to identify opioid use disorder in electronic health record data.利用自然语言处理技术在电子健康记录数据中识别阿片类药物使用障碍。
Int J Med Inform. 2023 Feb;170:104963. doi: 10.1016/j.ijmedinf.2022.104963. Epub 2022 Dec 10.

引用本文的文献

1
Electrocardiogram-Based Artificial Intelligence to Identify Coronary Artery Disease.基于心电图的人工智能识别冠状动脉疾病
JACC Adv. 2025 Jul 31;4(9):102041. doi: 10.1016/j.jacadv.2025.102041.
2
On the use of natural language processing to implement the target trial framework using unstructured data from the electronic health record.关于使用自然语言处理技术,利用电子健康记录中的非结构化数据来实施目标试验框架。
Glob Epidemiol. 2025 May 8;9:100204. doi: 10.1016/j.gloepi.2025.100204. eCollection 2025 Jun.
3
Flexible imputation toolkit for electronic health records.

本文引用的文献

1
Initial Validation of a Machine Learning-Derived Prognostic Test (KidneyIntelX) Integrating Biomarkers and Electronic Health Record Data To Predict Longitudinal Kidney Outcomes.基于机器学习的预后测试(KidneyIntelX)的初步验证,该测试整合了生物标志物和电子健康记录数据,以预测纵向肾脏结局。
Kidney360. 2020 Jun 30;1(8):731-739. doi: 10.34067/KID.0002252020. eCollection 2020 Aug 27.
2
Re-CHARGE-AF: Recalibration of the CHARGE-AF Model for Atrial Fibrillation Risk Prediction in Patients With Acute Stroke.Re-CHARGE-AF:用于急性脑卒中患者心房颤动风险预测的 CHARGE-AF 模型的再校准。
J Am Heart Assoc. 2021 Nov 2;10(21):e022363. doi: 10.1161/JAHA.121.022363. Epub 2021 Oct 20.
3
用于电子健康记录的灵活插补工具包。
Sci Rep. 2025 May 17;15(1):17176. doi: 10.1038/s41598-025-02276-5.
4
Remdesivir associated with reduced mortality in hospitalized COVID-19 patients: treatment effectiveness using real-world data and natural language processing.瑞德西韦与住院COVID-19患者死亡率降低相关:利用真实世界数据和自然语言处理的治疗效果
BMC Infect Dis. 2025 Apr 12;25(1):513. doi: 10.1186/s12879-025-10817-6.
5
Identification of Patients With Congestive Heart Failure From the Electronic Health Records of Two Hospitals: Retrospective Study.从两家医院的电子健康记录中识别充血性心力衰竭患者:回顾性研究
JMIR Med Inform. 2025 Apr 10;13:e64113. doi: 10.2196/64113.
6
Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study.通过出院小结自动处理改善免疫介导性炎症疾病患者的表型分析:多中心队列研究
JMIR Med Inform. 2025 Apr 9;13:e68704. doi: 10.2196/68704.
7
The Heart of Transformation: Exploring Artificial Intelligence in Cardiovascular Disease.变革的核心:探索心血管疾病中的人工智能
Biomedicines. 2025 Feb 10;13(2):427. doi: 10.3390/biomedicines13020427.
8
Natural language processing of electronic medical records identifies cardioprotective agents for anthracycline induced cardiotoxicity.电子病历的自然语言处理可识别用于蒽环类药物诱导心脏毒性的心脏保护剂。
Sci Rep. 2025 Feb 24;15(1):6678. doi: 10.1038/s41598-025-91187-6.
9
A deep learning digital biomarker to detect hypertension and stratify cardiovascular risk from the electrocardiogram.一种用于从心电图检测高血压并对心血管风险进行分层的深度学习数字生物标志物。
NPJ Digit Med. 2025 Feb 22;8(1):120. doi: 10.1038/s41746-025-01491-8.
10
Observational study of sudden cardiac arrest risk (OSCAR): Rationale and design of an electronic health records cohort.心脏骤停风险观察性研究(OSCAR):电子健康记录队列的基本原理与设计
Int J Cardiol Heart Vasc. 2025 Jan 19;56:101614. doi: 10.1016/j.ijcha.2025.101614. eCollection 2025 Feb.
Ontology-driven weak supervision for clinical entity classification in electronic health records.
基于本体的电子健康记录中临床实体分类的弱监督方法。
Nat Commun. 2021 Apr 1;12(1):2017. doi: 10.1038/s41467-021-22328-4.
4
Deep Neural Networks Can Predict New-Onset Atrial Fibrillation From the 12-Lead ECG and Help Identify Those at Risk of Atrial Fibrillation-Related Stroke.深度神经网络可通过 12 导联心电图预测新发心房颤动,并有助于识别心房颤动相关卒中风险。
Circulation. 2021 Mar 30;143(13):1287-1298. doi: 10.1161/CIRCULATIONAHA.120.047829. Epub 2021 Feb 16.
5
Performance of Atrial Fibrillation Risk Prediction Models in Over 4 Million Individuals.超过 400 万人的房颤风险预测模型表现。
Circ Arrhythm Electrophysiol. 2021 Jan;14(1):e008997. doi: 10.1161/CIRCEP.120.008997. Epub 2020 Dec 9.
6
Implicit bias of encoded variables: frameworks for addressing structured bias in EHR-GWAS data.编码变量的隐含偏差:解决电子健康记录- GWAS 数据中结构性偏差的框架。
Hum Mol Genet. 2020 Sep 30;29(R1):R33-R41. doi: 10.1093/hmg/ddaa192.
7
Graphical calibration curves and the integrated calibration index (ICI) for survival models.生存模型的图形校准曲线和综合校准指数(ICI)
Stat Med. 2020 Sep 20;39(21):2714-2742. doi: 10.1002/sim.8570. Epub 2020 Jun 16.
8
Prediction of mortality from 12-lead electrocardiogram voltage data using a deep neural network.利用深度神经网络预测 12 导联心电图电压数据的死亡率。
Nat Med. 2020 Jun;26(6):886-891. doi: 10.1038/s41591-020-0870-z. Epub 2020 May 11.
9
Initial Precipitants and Recurrence of Atrial Fibrillation.初始诱发因素与心房颤动的复发。
Circ Arrhythm Electrophysiol. 2020 Mar;13(3):e007716. doi: 10.1161/CIRCEP.119.007716. Epub 2020 Feb 12.
10
Prediction of gestational diabetes based on nationwide electronic health records.基于全国电子健康记录预测妊娠期糖尿病。
Nat Med. 2020 Jan;26(1):71-76. doi: 10.1038/s41591-019-0724-8. Epub 2020 Jan 13.