• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

结合临床知识的电子健康记录自动化数据清洗方法。

An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge.

机构信息

Department of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Kasteelpark Arenberg 10 - Box 2446, 3001, Leuven, Belgium.

Leuven Statistics Research Center, KU Leuven, 3000, Leuven, Belgium.

出版信息

BMC Med Inform Decis Mak. 2021 Sep 17;21(1):267. doi: 10.1186/s12911-021-01630-7.

DOI:10.1186/s12911-021-01630-7
PMID:34535146
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8449435/
Abstract

BACKGROUND

The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration.

METHODS

We used EHR data collected from primary care in Flanders, Belgium during 1994-2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared.

RESULTS

All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1-10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%.

CONCLUSIONS

We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.

摘要

背景

电子健康记录 (EHR) 数据在临床研究中的应用正在迅速增加,但数据资源的丰富性也带来了数据清理的挑战。如果能够自动完成数据清理,将节省时间。此外,其他领域数据的自动化数据清理工具通常统一处理所有变量,这意味着它们不能很好地适用于临床数据,因为需要考虑特定于变量的信息。本文提出了一种考虑临床知识的 EHR 数据自动化数据清理方法。

方法

我们使用了 1994 年至 2015 年期间在比利时佛兰德斯收集的初级保健 EHR 数据。我们构建了一个临床知识数据库来存储所有需要数据清理的特定于变量的信息。我们应用模糊搜索自动检测和替换拼写错误的单位,并根据特定于变量的转换公式执行单位转换。然后根据临床知识校正数字值并检测异常值。总共有 52 个临床变量进行了清理,并比较了清理前后缺失值的百分比(完整性)和正常值范围内的值的百分比(正确性)。

结果

所有变量在数据清理前的完整性均为 100%。42 个变量缺失值百分比下降不到 1%,9 个变量下降 1-10%。只有 1 个变量的完整性下降较大(13.36%)。所有变量的正常值范围内的值百分比均超过 50%,其中 43 个变量的百分比高于 70%。

结论

我们提出了一种针对临床变量的通用方法,该方法实现了高度自动化,能够处理大规模数据。这种方法大大提高了清理数据的效率,并为非技术人员消除了技术障碍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/eaa9777f2cf5/12911_2021_1630_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/ea64c15a3239/12911_2021_1630_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/81653dd45ad4/12911_2021_1630_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/ed1d8df2d924/12911_2021_1630_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/eaa9777f2cf5/12911_2021_1630_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/ea64c15a3239/12911_2021_1630_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/81653dd45ad4/12911_2021_1630_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/ed1d8df2d924/12911_2021_1630_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bde1/8449435/eaa9777f2cf5/12911_2021_1630_Fig4_HTML.jpg

相似文献

1
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge.结合临床知识的电子健康记录自动化数据清洗方法。
BMC Med Inform Decis Mak. 2021 Sep 17;21(1):267. doi: 10.1186/s12911-021-01630-7.
2
Sharing of clinical data in a maternity setting: how do paper hand-held records and electronic health records compare for completeness?产科环境下临床数据的共享:纸质手持记录与电子健康记录在完整性方面如何比较?
BMC Health Serv Res. 2014 Dec 21;14:650. doi: 10.1186/s12913-014-0650-x.
3
A method for cohort selection of cardiovascular disease records from an electronic health record system.一种从电子健康记录系统中选择心血管疾病记录队列的方法。
Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30.
4
Electronic primary dental care records in research: A case study of validation and quality assurance strategies.电子初级牙科保健记录在研究中的应用:验证和质量保证策略的案例研究。
Int J Med Inform. 2019 Jul;127:88-94. doi: 10.1016/j.ijmedinf.2019.04.007. Epub 2019 Apr 12.
5
Real-time database drawn from an electronic health record for a thoracic surgery unit: high-quality clinical data saving time and human resources.从胸外科电子健康记录中提取的实时数据库:高质量临床数据节省时间和人力资源。
Eur J Cardiothorac Surg. 2014 Jun;45(6):1017-9. doi: 10.1093/ejcts/ezt577. Epub 2014 Jan 6.
6
An automated knowledge-based textual summarization system for longitudinal, multivariate clinical data.一种用于纵向多变量临床数据的基于知识的自动文本摘要系统。
J Biomed Inform. 2016 Jun;61:159-75. doi: 10.1016/j.jbi.2016.03.022. Epub 2016 Mar 30.
7
Assessing EHR Data for Use in Clinical Improvement and Research.评估电子健康记录数据在临床改善和研究中的应用。
Am J Nurs. 2022 Jun 1;122(6):32-41. doi: 10.1097/01.NAJ.0000832728.09164.3f.
8
Evaluation of an automated knowledge-based textual summarization system for longitudinal clinical data, in the intensive care domain.评估一个自动化的基于知识的文本摘要系统在重症监护领域的纵向临床数据中的应用。
Artif Intell Med. 2017 Oct;82:20-33. doi: 10.1016/j.artmed.2017.09.001. Epub 2017 Sep 27.
9
Electronic medical record integration with a database for adult congenital heart disease: Early experience and progress in automating multicenter data collection.电子病历与成人先天性心脏病数据库的整合:多中心数据收集自动化的早期经验与进展
Int J Cardiol. 2015 Oct 1;196:178-82. doi: 10.1016/j.ijcard.2015.05.140. Epub 2015 May 27.
10
A rigorous algorithm to detect and clean inaccurate adult height records within EHR systems.一种用于检测和清理电子健康记录(EHR)系统中不准确成人身高记录的严格算法。
Appl Clin Inform. 2014 Feb 19;5(1):118-26. doi: 10.4338/ACI-2013-09-RA-0074. eCollection 2014.

引用本文的文献

1
An Artificial Intelligence-Based Model to Predict Pregnancy After Intrauterine Insemination: A Retrospective Analysis of 9501 Cycles.一种基于人工智能的预测宫内人工授精后妊娠的模型:对9501个周期的回顾性分析
J Pers Med. 2025 Jul 12;15(7):308. doi: 10.3390/jpm15070308.
2
Artificial intelligence technology in ophthalmology public health: current applications and future directions.眼科公共卫生中的人工智能技术:当前应用与未来方向。
Front Cell Dev Biol. 2025 Apr 17;13:1576465. doi: 10.3389/fcell.2025.1576465. eCollection 2025.
3
Protocol for improving equity in quantitative big data cleaning: lessons from longitudinal analysis of electronic health records from underrepresented and marginalized communities.

本文引用的文献

1
Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data.通过FIDDLE实现电子健康记录分析的普及:一种用于结构化临床数据的灵活的数据驱动预处理管道。
J Am Med Inform Assoc. 2020 Dec 9;27(12):1921-1934. doi: 10.1093/jamia/ocaa139.
2
Institutionalized data quality assessments: a critical pathway to improving the accuracy of integrated disease surveillance data in Sierra Leone.制度化的数据质量评估:提高塞拉利昂综合疾病监测数据准确性的关键途径。
BMC Health Serv Res. 2020 Aug 7;20(1):724. doi: 10.1186/s12913-020-05591-x.
3
Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort.
改善定量大数据清理公平性的方案:来自对代表性不足和边缘化社区电子健康记录纵向分析的经验教训。
Int J Epidemiol. 2025 Feb 16;54(2). doi: 10.1093/ije/dyaf013.
4
Unlocking precision medicine: clinical applications of integrating health records, genetics, and immunology through artificial intelligence.开启精准医学:通过人工智能整合健康记录、遗传学和免疫学的临床应用
J Biomed Sci. 2025 Feb 7;32(1):16. doi: 10.1186/s12929-024-01110-w.
5
lab2clean: a novel algorithm for automated cleaning of retrospective clinical laboratory results data for secondary uses.lab2clean:一种用于回顾性临床实验室结果数据自动清洗的新型算法,以支持二次利用。
BMC Med Inform Decis Mak. 2024 Sep 3;24(1):245. doi: 10.1186/s12911-024-02652-7.
6
Data Quality-Driven Improvement in Health Care: Systematic Literature Review.数据驱动的医疗质量改进:系统文献回顾。
J Med Internet Res. 2024 Aug 22;26:e57615. doi: 10.2196/57615.
7
Self-reported sleep duration and quality and cardiovascular diseases among middle-aged and older Chinese: A 7-year longitudinal cohort study.自报的睡眠时长和质量与中老年中国人的心血管疾病:一项 7 年的纵向队列研究。
J Clin Hypertens (Greenwich). 2024 Oct;26(10):1145-1154. doi: 10.1111/jch.14883. Epub 2024 Aug 19.
8
Development and validation of the SickKids Enterprise-wide Data in Azure Repository (SEDAR).病童医院全企业范围的Azure存储库数据(SEDAR)的开发与验证。
Heliyon. 2023 Nov 2;9(11):e21586. doi: 10.1016/j.heliyon.2023.e21586. eCollection 2023 Nov.
9
A New Strategy for Evaluating the Quality of Laboratory Results for Big Data Research: Using External Quality Assessment Survey Data (2010-2020).用于大数据研究的实验室结果质量评估的新策略:使用外部质量评估调查数据(2010-2020 年)。
Ann Lab Med. 2023 Sep 1;43(5):425-433. doi: 10.3343/alm.2023.43.5.425. Epub 2023 Apr 21.
10
Cleaning of anthropometric data from PCORnet electronic health records using automated algorithms.使用自动化算法清理PCORnet电子健康记录中的人体测量数据。
JAMIA Open. 2022 Nov 2;5(4):ooac089. doi: 10.1093/jamiaopen/ooac089. eCollection 2022 Dec.
从纵向电子健康记录中自动清理儿科人体测量数据:方案和在大型患者队列中的应用。
Sci Rep. 2020 Jun 23;10(1):10164. doi: 10.1038/s41598-020-66925-7.
4
DataGauge: A Practical Process for Systematically Designing and Implementing Quality Assessments of Repurposed Clinical Data.数据评估:一种系统设计和实施重新利用临床数据质量评估的实用流程。
EGEMS (Wash DC). 2019 Jul 25;7(1):32. doi: 10.5334/egems.286.
5
A basic model for assessing primary health care electronic medical record data quality.评估初级卫生保健电子病历数据质量的基本模型。
BMC Med Inform Decis Mak. 2019 Feb 12;19(1):30. doi: 10.1186/s12911-019-0740-0.
6
An unsupervised and customizable misspelling generator for mining noisy health-related text sources.一种用于挖掘噪声健康相关文本源的无监督和可定制的拼写错误生成器。
J Biomed Inform. 2018 Dec;88:98-107. doi: 10.1016/j.jbi.2018.11.007. Epub 2018 Nov 13.
7
Is there a correlation between an eGFR slope measured over a 5-year period and incident cardiovascular events in the following 5 years among a Flemish general practice population: a retrospective cohort study.在弗拉芒全科医疗人群中,5年期间测量的估算肾小球滤过率(eGFR)斜率与随后5年发生的心血管事件之间是否存在相关性:一项回顾性队列研究。
BMJ Open. 2018 Nov 12;8(11):e023594. doi: 10.1136/bmjopen-2018-023594.
8
A Review of Data Quality Assessment in Emergency Medical Services.急诊医疗服务中的数据质量评估综述
Open Med Inform J. 2018 May 31;12:19-32. doi: 10.2174/1874431101812010019. eCollection 2018.
9
DQ-v: A Database-Agnostic Framework for Exploring Variability in Electronic Health Record Data Across Time and Site Location.DQ-v:一个与数据库无关的框架,用于探索电子健康记录数据随时间和地点的变异性。
EGEMS (Wash DC). 2017 May 10;5(1):3. doi: 10.13063/2327-9214.1277.
10
A Data Quality Assessment Guideline for Electronic Health Record Data Reuse.电子健康记录数据复用的数据质量评估指南
EGEMS (Wash DC). 2017 Sep 4;5(1):14. doi: 10.5334/egems.218.