Cheng Ching-Hsue, Chang Jing-Rong, Huang Hao-Hsuan
Department of Information Management, National Yunlin University of Science & Technology, 123, section 3, University Road, Touliu, Yunlin 640, Taiwan.
Department of Information Management, Chaoyang University of Technology, Taichung, Taiwan.
Comput Biol Med. 2020 Jul;122:103824. doi: 10.1016/j.compbiomed.2020.103824. Epub 2020 May 30.
Data in the medical field often contain missing values and may result in biased research results. Therefore, the objective of this work is to propose a new imputation method, a novel weighted distance threshold method, to impute missing values. After several experiments, we find that the proposed imputation method has the following benefits. (1) The proposed method with purity can reassign instances into the nearest class of the dataset, and the purity computation can filter outliers; (2) The proposed method redefines the degree of missing values and can determine attributes and instances relative to the missing values in different datasets; and (3) The proposed method need not set the k value of the nearest neighborhood because this study identifies the k value based on the best threshold to calculate purity to enhance the results of imputation. In addition, the distance threshold can adjust the optimal nearest neighborhood to estimate missing values. This study implements several experiments to compare the proposed method with other imputation methods using different missing types, missing degrees, and types of datasets. The results indicate that the proposed imputation method is better than the listed methods. Moreover, this study uses the stroke dataset from the International Stroke Trial (IST) to verify whether the proposed method can be effectively applied in practice, and the results show that the proposed method achieves 90% accuracy in the Stroke dataset.
医学领域的数据常常包含缺失值,这可能会导致有偏差的研究结果。因此,这项工作的目标是提出一种新的插补方法,即一种新颖的加权距离阈值方法,用于插补缺失值。经过多次实验,我们发现所提出的插补方法具有以下优点。(1)所提出的具有纯度的方法可以将实例重新分配到数据集中最近的类别,并且纯度计算可以过滤异常值;(2)所提出的方法重新定义了缺失值的程度,并且可以确定不同数据集中相对于缺失值的属性和实例;(3)所提出的方法无需设置最近邻域的k值,因为本研究基于计算纯度的最佳阈值来确定k值,以提高插补结果。此外,距离阈值可以调整最优最近邻域来估计缺失值。本研究进行了多次实验,将所提出的方法与其他插补方法在不同的缺失类型、缺失程度和数据集类型上进行比较。结果表明,所提出的插补方法优于所列方法。此外,本研究使用国际中风试验(IST)的中风数据集来验证所提出的方法是否可以在实际中有效应用,结果表明所提出的方法在中风数据集中达到了90%的准确率。