是否是时候停止将数据清理问题掩盖起来了？一种新的生长数据异常值管理算法。

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.

机构信息

The Roslin Institute, The University of Edinburgh, Easter Bush Campus, Midlothian, Edinburgh, United Kingdom.

The Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush Campus, Midlothian, Edinburgh, United Kingdom.

出版信息

PLoS One. 2020 Jan 24;15(1):e0228154. doi: 10.1371/journal.pone.0228154. eCollection 2020.

DOI:10.1371/journal.pone.0228154

PMID:31978151

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6980495/

Abstract

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

摘要

所有数据都容易出错，在进行分析之前需要进行数据清理。一个重要的例子是纵向生长数据，对于这种数据，目前还没有普遍认可的标准方法来识别和删除不合理的值，并且许多现有的方法都存在限制，限制了它们在不同领域的使用。设计了一种决策算法，该算法根据预定义的截止值和逻辑规则组合来修改或删除生长测量值。使用和不使用该算法测试了五种用于生长数据的清理方法，并将其应用于五个不同的纵向生长数据集：四个未清理的犬体重或身高数据集和一个带有随机模拟错误的已清理的人类体重数据集。在添加算法之前，基于非线性混合效应模型的数据清理在所有数据集都是最有效的，并且平均比其他方法具有至少 26.00%更高的灵敏度和 0.12%更高的特异性。使用算法的数据清理方法提高了数据的保留率，并能够根据黄金标准纠正模拟错误；将值返回到错误模拟之前的原始状态。该算法提高了所有数据清理方法的性能，并将非线性混合效应模型方法的平均灵敏度和特异性分别提高了 7.68%和 0.42%。使用非线性混合效应模型结合算法来清理数据，可以允许个体生长轨迹通过使用重复的纵向测量值从人群中有所不同，识别连续的错误或第一个数据输入中的错误，避免了对最小数据量的要求，尽可能通过纠正错误而不是删除它们来保留数据，并智能地删除重复项。该算法广泛适用于不同哺乳动物物种的人体测量数据清理，并且可以适应于其他领域的使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d567/6980495/d73ca08f4075/pone.0228154.g001.jpg

相似文献

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.是否是时候停止将数据清理问题掩盖起来了？一种新的生长数据异常值管理算法。

PLoS One. 2020 Jan 24;15(1):e0228154. doi: 10.1371/journal.pone.0228154. eCollection 2020.

Identifying biologically implausible values in big longitudinal data: an example applied to child growth data from the Brazilian food and nutrition surveillance system.识别大型纵向数据中的生物学上不合理的值：应用于巴西食品和营养监测系统儿童生长数据的示例。

BMC Med Res Methodol. 2024 Feb 15;24(1):38. doi: 10.1186/s12874-024-02161-1.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Cleaning of anthropometric data from PCORnet electronic health records using automated algorithms.使用自动化算法清理PCORnet电子健康记录中的人体测量数据。

JAMIA Open. 2022 Nov 2;5(4):ooac089. doi: 10.1093/jamiaopen/ooac089. eCollection 2022 Dec.

Wind power data cleaning using RANSAC-based polynomial and linear regression with adaptive threshold.基于RANSAC的多项式和线性回归以及自适应阈值的风力发电数据清洗

Sci Rep. 2025 Feb 11;15(1):5105. doi: 10.1038/s41598-025-89177-9.

A better performing algorithm for identification of implausible growth data from longitudinal pediatric medical records.一种能够更好地识别纵向儿科医疗记录中不合理增长数据的算法。

Sci Rep. 2024 Aug 6;14(1):18276. doi: 10.1038/s41598-024-69161-5.

Automated identification of implausible values in growth data from pediatric electronic health records.自动识别儿科电子健康记录中生长数据的不合理值。

J Am Med Inform Assoc. 2017 Nov 1;24(6):1080-1087. doi: 10.1093/jamia/ocx037.

Identifying erroneous height and weight values from adult electronic health records in the All of Us research program.从“我们所有人”研究计划中的成人电子健康记录中识别错误的身高和体重值。

J Biomed Inform. 2024 Jul;155:104660. doi: 10.1016/j.jbi.2024.104660. Epub 2024 May 23.

Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort.从纵向电子健康记录中自动清理儿科人体测量数据：方案和在大型患者队列中的应用。

Sci Rep. 2020 Jun 23;10(1):10164. doi: 10.1038/s41598-020-66925-7.

New approaches and technical considerations in detecting outlier measurements and trajectories in longitudinal children growth data.检测纵向儿童生长数据中异常测量值和轨迹的新方法和技术考虑因素。

BMC Med Res Methodol. 2023 Oct 13;23(1):232. doi: 10.1186/s12874-023-02045-w.

引用本文的文献

Protocol for improving equity in quantitative big data cleaning: lessons from longitudinal analysis of electronic health records from underrepresented and marginalized communities.改善定量大数据清理公平性的方案：来自对代表性不足和边缘化社区电子健康记录纵向分析的经验教训。

Int J Epidemiol. 2025 Feb 16;54(2). doi: 10.1093/ije/dyaf013.

BMC Med Res Methodol. 2024 Feb 15;24(1):38. doi: 10.1186/s12874-024-02161-1.

BMC Med Res Methodol. 2023 Oct 13;23(1):232. doi: 10.1186/s12874-023-02045-w.

Association between infant breastfeeding practices and timing of peak height velocity: A nationwide longitudinal survey in Japan.婴儿喂养方式与身高增长高峰时间的关联：日本全国性纵向调查。

Pediatr Res. 2023 Nov;94(5):1845-1854. doi: 10.1038/s41390-023-02706-y. Epub 2023 Jul 3.

Artificial Intelligence-Based Medical Data Mining.基于人工智能的医学数据挖掘

J Pers Med. 2022 Aug 24;12(9):1359. doi: 10.3390/jpm12091359.

The impact of the COVID-19 pandemic on a cohort of Labrador retrievers in England.英格兰一组拉布拉多猎犬犬受 COVID-19 大流行的影响。

BMC Vet Res. 2022 Jun 24;18(1):246. doi: 10.1186/s12917-022-03319-z.

Characterizing Undernourished Children Under-Five Years Old with Diarrhoea in Mozambique: A Hospital Based Cross-Sectional Study, 2015-2019.莫桑比克五岁以下腹泻营养不良儿童特征：2015-2019 年基于医院的横断面研究。

Nutrients. 2022 Mar 10;14(6):1164. doi: 10.3390/nu14061164.

Veterinary Big Data: When Data Goes to the Dogs.兽医大数据：当数据应用于犬类时

Animals (Basel). 2021 Jun 23;11(7):1872. doi: 10.3390/ani11071872.

Associations between Food Group Intake and Physical Frailty in Irish Community-Dwelling Older Adults.爱尔兰社区居住的老年人食物组摄入量与身体虚弱之间的关联。

Nutr Metab Insights. 2021 Mar 30;14:11786388211006447. doi: 10.1177/11786388211006447. eCollection 2021.

本文引用的文献

Comparisons of Self-Reported and Measured Height and Weight, BMI, and Obesity Prevalence from National Surveys: 1999-2016.自我报告的身高和体重与实际测量值、BMI 及肥胖率的比较：1999-2016 年全国性调查。

Obesity (Silver Spring). 2019 Oct;27(10):1711-1719. doi: 10.1002/oby.22591.

A Comparison of Existing Methods to Detect Weight Data Errors in a Pediatric Academic Medical Center.儿科学术医疗中心中检测体重数据错误的现有方法比较

AMIA Annu Symp Proc. 2018 Dec 5;2018:1103-1109. eCollection 2018.

Not so implausible: impact of longitudinal assessment of implausible anthropometric measures on obesity prevalence and weight change in children and adolescents.并非如此难以置信：对不合理人体测量指标进行纵向评估对儿童和青少年肥胖患病率和体重变化的影响。

Ann Epidemiol. 2019 Mar;31:69-74.e5. doi: 10.1016/j.annepidem.2019.01.006. Epub 2019 Feb 5.

Completeness and accuracy of anthropometric measurements in electronic medical records for children attending primary care.基层医疗中儿童电子病历中人体测量数据的完整性和准确性。

J Innov Health Inform. 2018 Mar 9;25(1):963. doi: 10.14236/jhi.v25i1.963.

Identifying and categorizing spurious weight data in electronic medical records.识别和分类电子病历中的虚假体重数据。

Am J Clin Nutr. 2018 Mar 1;107(3):420-426. doi: 10.1093/ajcn/nqx056.

New approach for the identification of implausible values and outliers in longitudinal childhood anthropometric data.用于鉴定纵向儿童人体测量数据中不合理值和离群值的新方法。

Ann Epidemiol. 2018 Mar;28(3):204-211.e3. doi: 10.1016/j.annepidem.2018.01.007. Epub 2018 Jan 11.

Accuracy of self-reported height, weight and waist circumference in a Japanese sample.日本样本中自我报告的身高、体重和腰围的准确性。

Obes Sci Pract. 2017 Nov 3;3(4):417-424. doi: 10.1002/osp4.122. eCollection 2017 Dec.

Growth standard charts for monitoring bodyweight in dogs of different sizes.用于监测不同体型犬只体重的生长标准图表。

PLoS One. 2017 Sep 5;12(9):e0182064. doi: 10.1371/journal.pone.0182064. eCollection 2017.

Automated identification of implausible values in growth data from pediatric electronic health records.自动识别儿科电子健康记录中生长数据的不合理值。

J Am Med Inform Assoc. 2017 Nov 1;24(6):1080-1087. doi: 10.1093/jamia/ocx037.

Socioeconomic Inequalities in Body Mass Index across Adulthood: Coordinated Analyses of Individual Participant Data from Three British Birth Cohort Studies Initiated in 1946, 1958 and 1970.成年期体重指数的社会经济不平等：对1946年、1958年和1970年启动的三项英国出生队列研究的个体参与者数据进行的综合分析。

PLoS Med. 2017 Jan 10;14(1):e1002214. doi: 10.1371/journal.pmed.1002214. eCollection 2017 Jan.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

是否是时候停止将数据清理问题掩盖起来了？一种新的生长数据异常值管理算法。

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献