Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
Comput Methods Programs Biomed. 2018 Aug;162:69-85. doi: 10.1016/j.cmpb.2018.05.007. Epub 2018 May 5.
Datamining (DM) has, over the last decade, received increased attention in the medical domain and has been widely used to analyze medical datasets in order to extract useful knowledge and previously unknown patterns. However, historical medical data can often comprise inconsistent, noisy, imbalanced, missing and high dimensional data. These challenges lead to a serious bias in predictive modeling and reduce the performance of DM techniques. Data preprocessing is, therefore, an essential step in knowledge discovery as regards improving the quality of data and making it appropriate and suitable for DM techniques. The objective of this paper is to review the use of preprocessing techniques in clinical datasets.
We performed a systematic map of studies regarding the application of data preprocessing to healthcare and published between January 2000 and December 2017. A search string was determined on the basis of the mapping questions and the PICO categories. The search string was then applied in digital databases covering the fields of computer science and medical informatics in order to identify relevant studies. The studies were initially selected by reading their titles, abstracts and keywords. Those that were selected at that stage were then reviewed using a set of inclusion and exclusion criteria in order to eliminate any that were not relevant. This process resulted in 126 primary studies.
Selected studies were analyzed and classified according to their publication years and channels, research type, empirical type and contribution type. The findings of this mapping study revealed that researchers have paid a considerable amount of attention to preprocessing in medical DM in last decade. A significant number of the selected studies used data reduction and cleaning preprocessing tasks. Moreover, the disciplines in which preprocessing have received most attention are: cardiology, endocrinology and oncology.
Researchers should develop and implement standards for an effective integration of multiple medical data types. Moreover, we identified the need to perform literature reviews.
在过去十年中,数据挖掘(DM)在医学领域受到了越来越多的关注,并被广泛用于分析医疗数据集,以提取有用的知识和以前未知的模式。然而,历史医疗数据通常可能包含不一致、嘈杂、不平衡、缺失和高维数据。这些挑战导致预测模型严重偏向,降低了 DM 技术的性能。因此,数据预处理是知识发现的一个重要步骤,可以提高数据的质量,并使其适合 DM 技术。本文的目的是回顾预处理技术在临床数据集中的应用。
我们对 2000 年 1 月至 2017 年 12 月期间发表的关于将数据预处理应用于医疗保健的研究进行了系统的图谱绘制。基于映射问题和 PICO 类别确定了搜索字符串。然后,该搜索字符串被应用于涵盖计算机科学和医学信息学领域的数字数据库,以识别相关研究。通过阅读标题、摘要和关键字初步选择研究。然后,使用一套包括和排除标准对这些研究进行审查,以排除不相关的研究。这一过程产生了 126 项主要研究。
所选研究根据其出版年份和渠道、研究类型、实证类型和贡献类型进行了分析和分类。这项映射研究的结果表明,研究人员在过去十年中对医学 DM 中的预处理给予了相当大的关注。相当多的选定研究使用了数据减少和清理预处理任务。此外,预处理受到关注最多的学科是:心脏病学、内分泌学和肿瘤学。
研究人员应制定和实施有效整合多种医疗数据类型的标准。此外,我们还发现需要进行文献综述。