基于最近邻的高维混合数据集插补方法。

Imputation methods for high-dimensional mixed-type datasets by nearest neighbors.

机构信息

Government College University Faisalabad, Pakistan; Ludwig-Maximilians-Universität München, Germany.

Ludwig-Maximilians-Universität München, Germany.

出版信息

Comput Biol Med. 2021 Aug;135:104577. doi: 10.1016/j.compbiomed.2021.104577. Epub 2021 Jun 17.

DOI:10.1016/j.compbiomed.2021.104577

PMID:34216892

Abstract

In modern biomedical research, the data often contain a large number of variables of mixed data types (continuous, multi-categorical, or binary) but on some variables observations are missing. Imputation is a common solution when the downstream analyses require a complete data matrix. Several imputation methods are available that work under specific distributional assumptions. We propose an improvement over the popular non-parametric nearest neighbor imputation method which requires no particular assumptions. The proposed method makes practical and effective use of the information on the association among the variables. In particular, we propose a weighted version of the L distance for mixed-type data, which uses the information from a subset of important variables only. The performance of the proposed method is investigated using a variety of simulated and real data from different areas of application. The results show that the proposed methods yield smaller imputation error and better performance when compared to other approaches. It is also shown that the proposed imputation method works efficiently even when the number of samples is smaller than the number of variables.

摘要

在现代生物医学研究中，数据通常包含大量混合数据类型（连续型、多类别型或二分类）的变量，但在某些变量上存在观测缺失。当下游分析需要完整的数据矩阵时，插补是一种常见的解决方案。有几种插补方法可用于特定的分布假设。我们提出了一种改进的流行的非参数最近邻插补方法，该方法不需要特定的假设。所提出的方法实际有效地利用了变量之间关联的信息。特别是，我们提出了一种混合类型数据的 L 距离的加权版本，该方法仅使用重要变量子集的信息。使用来自不同应用领域的各种模拟和真实数据研究了所提出方法的性能。结果表明，与其他方法相比，所提出的方法在插补误差和性能方面都有较小的改善。结果还表明，即使在样本数量小于变量数量的情况下，所提出的插补方法也能有效地工作。

相似文献

Imputation methods for high-dimensional mixed-type datasets by nearest neighbors.基于最近邻的高维混合数据集插补方法。

Comput Biol Med. 2021 Aug;135:104577. doi: 10.1016/j.compbiomed.2021.104577. Epub 2021 Jun 17.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?高维表型组数据中的缺失值插补：是否可插补以及如何插补？

BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.

Missing value imputation for gene expression data by tailored nearest neighbors.通过定制最近邻算法对基因表达数据进行缺失值插补

Stat Appl Genet Mol Biol. 2017 Apr 25;16(2):95-106. doi: 10.1515/sagmb-2015-0098.

Advanced methods for missing values imputation based on similarity learning.基于相似性学习的缺失值插补先进方法。

PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.

Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies.基于分布的最近邻插补法用于截断高维数据及其在临床前和临床代谢组学研究中的应用

BMC Bioinformatics. 2017 Feb 20;18(1):114. doi: 10.1186/s12859-017-1547-6.

A real data-driven simulation strategy to select an imputation method for mixed-type trait data.一种基于真实数据驱动的选择混合类型性状数据插补方法的模拟策略。

PLoS Comput Biol. 2023 Mar 22;19(3):e1010154. doi: 10.1371/journal.pcbi.1010154. eCollection 2023 Mar.

A novel weighted distance threshold method for handling medical missing values.一种用于处理医学缺失值的新型加权距离阈值方法。

Comput Biol Med. 2020 Jul;122:103824. doi: 10.1016/j.compbiomed.2020.103824. Epub 2020 May 30.

MissForest--non-parametric missing value imputation for mixed-type data.MissForest--用于混合类型数据的非参数缺失值插补。

Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.

Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.基于模型的医学数据集缺失值插补中的异常值剔除。

J Healthc Eng. 2018 Feb 4;2018:1817479. doi: 10.1155/2018/1817479. eCollection 2018.

EvoImp: Multiple Imputation of Multi-label Classification data with a genetic algorithm.EvoImp：基于遗传算法的多标签分类数据的多重插补。

PLoS One. 2024 Jan 19;19(1):e0297147. doi: 10.1371/journal.pone.0297147. eCollection 2024.

引用本文的文献

Topological Structures in the Space of Treatment-Naïve Patients with Chronic Lymphocytic Leukemia.初治慢性淋巴细胞白血病患者空间中的拓扑结构

Cancers (Basel). 2024 Jul 26;16(15):2662. doi: 10.3390/cancers16152662.

Exploring the utility of a latent variable as comprehensive inflammatory prognostic index in critically ill patients with cerebral infarction.探索潜在变量作为脑梗死重症患者综合炎症预后指标的效用。

Front Neurol. 2024 Jan 15;15:1287895. doi: 10.3389/fneur.2024.1287895. eCollection 2024.

Discrete Missing Data Imputation Using Multilayer Perceptron and Momentum Gradient Descent.使用多层感知机和动量梯度下降进行离散缺失数据插补。

Sensors (Basel). 2022 Jul 28;22(15):5645. doi: 10.3390/s22155645.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于最近邻的高维混合数据集插补方法。

Imputation methods for high-dimensional mixed-type datasets by nearest neighbors.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献