Suppr超能文献

面向真实世界数据的数据清理的正常工作流程和关键策略:观点

Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.

作者信息

Guo Manping, Wang Yiming, Yang Qiaoning, Li Rui, Zhao Yang, Li Chenfei, Zhu Mingbo, Cui Yao, Jiang Xin, Sheng Song, Li Qingna, Gao Rui

机构信息

Postdoctoral Research Station, China Academy of Chinese Medical Sciences, Beijing, China.

Postdoctoral Works Station, Yabao Pharmaceutical Group Co, Ltd, Yuncheng, China.

出版信息

Interact J Med Res. 2023 Sep 21;12:e44310. doi: 10.2196/44310.

Abstract

With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a "data disaster." Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting "dirty data," which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.

摘要

随着科学、技术和工程的快速发展,在过去20年里许多领域产生了大量数据。在医学研究过程中,数据不断生成,大量的真实世界数据形成了一场“数据灾难”。有效的数据分析和挖掘基于数据可用性和高数据质量。高数据质量的前提是需要对数据进行清理。数据清理是检测和纠正“脏数据”的过程,是数据分析和管理的基础。此外,数据清理是提高数据质量的常用技术。然而,目前关于真实世界研究的文献几乎没有提供关于如何高效且符合伦理地建立和执行数据清理的指导。为解决这一问题,我们提出了一个用于真实世界研究的数据清理框架,重点关注3种最常见的脏数据类型(重复数据、缺失数据和离群值数据),以及一个数据清理的常规工作流程,以供未来研究中应用此类技术时参考。我们还针对数据清理中的常见问题提供了相关建议。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49d6/10557005/3bf3df3eb058/ijmr_v12i1e44310_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验