Government College University Faisalabad, Pakistan; Ludwig-Maximilians-Universität München, Germany.
Ludwig-Maximilians-Universität München, Germany.
Comput Biol Med. 2021 Aug;135:104577. doi: 10.1016/j.compbiomed.2021.104577. Epub 2021 Jun 17.
In modern biomedical research, the data often contain a large number of variables of mixed data types (continuous, multi-categorical, or binary) but on some variables observations are missing. Imputation is a common solution when the downstream analyses require a complete data matrix. Several imputation methods are available that work under specific distributional assumptions. We propose an improvement over the popular non-parametric nearest neighbor imputation method which requires no particular assumptions. The proposed method makes practical and effective use of the information on the association among the variables. In particular, we propose a weighted version of the L distance for mixed-type data, which uses the information from a subset of important variables only. The performance of the proposed method is investigated using a variety of simulated and real data from different areas of application. The results show that the proposed methods yield smaller imputation error and better performance when compared to other approaches. It is also shown that the proposed imputation method works efficiently even when the number of samples is smaller than the number of variables.
在现代生物医学研究中,数据通常包含大量混合数据类型(连续型、多类别型或二分类)的变量,但在某些变量上存在观测缺失。当下游分析需要完整的数据矩阵时,插补是一种常见的解决方案。有几种插补方法可用于特定的分布假设。我们提出了一种改进的流行的非参数最近邻插补方法,该方法不需要特定的假设。所提出的方法实际有效地利用了变量之间关联的信息。特别是,我们提出了一种混合类型数据的 L 距离的加权版本,该方法仅使用重要变量子集的信息。使用来自不同应用领域的各种模拟和真实数据研究了所提出方法的性能。结果表明,与其他方法相比,所提出的方法在插补误差和性能方面都有较小的改善。结果还表明,即使在样本数量小于变量数量的情况下,所提出的插补方法也能有效地工作。