使用加权最近邻法推断缺失基因型。

Imputing missing genotypes with weighted k nearest neighbors.

机构信息

Statistics, TU Dortmund University, Dortmund, Germany.

出版信息

J Toxicol Environ Health A. 2012;75(8-10):438-46. doi: 10.1080/15287394.2012.674910.

DOI:10.1080/15287394.2012.674910

Abstract

Missing values are a common problem in genetic association studies concerned with single-nucleotide polymorphisms (SNPs). Since many statistical methods cannot handle missing values, such values need to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are required that can be used to impute such missing values. In this study, an imputation procedure based on a weighted k nearest neighbors algorithm is presented. This approach, called KNNcatImpute, searches for the k SNPs that are most similar to the SNP whose missing values need to be replaced and uses these k SNPs to impute the missing values. Alternatively, KNNcatImpute can search for the k nearest subjects. In this situation, the missing values of an individual are imputed by considering subjects showing a DNA pattern similar to the one of this individual. In a comparison to other imputation approaches, KNNcatImpute shows the lowest rates of falsely imputed genotypes when applied to the SNP data from the GENICA study, a candidate SNP study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. Moreover, KNNcatImpute can also be applied to data from genome-wide association studies, as an application to a subset of the HapMap data demonstrates.

摘要

在关注单核苷酸多态性 (SNP) 的遗传关联研究中，缺失值是一个常见问题。由于许多统计方法无法处理缺失值，因此需要在实际分析之前将其删除。然而，仅考虑完整的观察结果通常会导致大量信息丢失。因此，需要使用可以用来推断这些缺失值的程序。在这项研究中，提出了一种基于加权 k 最近邻算法的推断程序。这种方法称为 KNNcatImpute，它会搜索与需要替换的缺失值的 SNP 最相似的 k 个 SNP，并使用这些 k 个 SNP 来推断缺失值。或者，KNNcatImpute 可以搜索 k 个最近的对象。在这种情况下，通过考虑与该个体的 DNA 模式相似的个体来推断个体的缺失值。与其他推断方法相比，当应用于专门用于识别与散发性乳腺癌相关的遗传和基因-环境相互作用的候选 SNP 研究 GENICA 研究的 SNP 数据时，KNNcatImpute 显示出最低的假推断基因型率。此外，KNNcatImpute 还可以应用于全基因组关联研究的数据，对 HapMap 数据子集的应用证明了这一点。

相似文献

Imputing missing genotypes with weighted k nearest neighbors.使用加权最近邻法推断缺失基因型。

J Toxicol Environ Health A. 2012;75(8-10):438-46. doi: 10.1080/15287394.2012.674910.

Identification of SNP interactions using logic regression.使用逻辑回归识别单核苷酸多态性（SNP）相互作用。

Biostatistics. 2008 Jan;9(1):187-98. doi: 10.1093/biostatistics/kxm024. Epub 2007 Jun 19.

Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.并行缺失值插补：一种用于微阵列数据的新型稳健缺失值估计算法。

Bioinformatics. 2005 May 15;21(10):2417-23. doi: 10.1093/bioinformatics/bti345. Epub 2005 Feb 24.

Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks.使用神经网络估算单核苷酸多态性的缺失基因型数据。

Eur J Hum Genet. 2008 Apr;16(4):487-95. doi: 10.1038/sj.ejhg.5201988. Epub 2008 Jan 16.

Accuracy of genotype imputation in sheep breeds.绵羊品种基因型推断的准确性。

Anim Genet. 2012 Feb;43(1):72-80. doi: 10.1111/j.1365-2052.2011.02208.x. Epub 2011 May 27.

Detecting high-order interactions of single nucleotide polymorphisms using genetic programming.使用遗传编程检测单核苷酸多态性的高阶相互作用。

Bioinformatics. 2007 Dec 15;23(24):3280-8. doi: 10.1093/bioinformatics/btm522. Epub 2007 Nov 15.

Evaluation of potential power gain with imputed genotypes in genome-wide association studies.在全基因组关联研究中使用推算基因型评估潜在的功效增益。

Hum Hered. 2009;68(1):23-34. doi: 10.1159/000210446. Epub 2009 Apr 1.

Imputation of missing single nucleotide polymorphism genotypes using a multivariate mixed model framework.使用多元混合模型框架对缺失的单核苷酸多态性基因型进行推断。

J Anim Sci. 2011 Jul;89(7):2042-9. doi: 10.2527/jas.2010-3297. Epub 2011 Feb 25.

Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows.通过在滑动窗口上进行快速最近邻搜索来推断大型单核苷酸多态性（SNP）面板中缺失的基因型。

Bioinformatics. 2007 Jul 1;23(13):i401-7. doi: 10.1093/bioinformatics/btm220.

How to link call rate and p-values for Hardy-Weinberg equilibrium as measures of genome-wide SNP data quality.如何将连锁率和 p 值与 Hardy-Weinberg 平衡一起作为全基因组 SNP 数据质量的衡量标准。

Stat Med. 2010 Sep 30;29(22):2347-58. doi: 10.1002/sim.4004.

引用本文的文献

K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data.用于推算缺失的纵向产前酒精数据的K近邻算法。

Adv Drug Alcohol Res. 2025 Jan 28;4:13449. doi: 10.3389/adar.2024.13449. eCollection 2024.

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival.基于机器学习的插补方法的多指标比较及其在乳腺癌生存分析中的应用。

BMC Med Res Methodol. 2024 Aug 30;24(1):191. doi: 10.1186/s12874-024-02305-3.

Genome-wide association study for in vitro digestibility and related traits in triticale forage.全基因组关联研究黑小麦饲草体外消化率及相关性状。

BMC Plant Biol. 2024 Mar 27;24(1):223. doi: 10.1186/s12870-024-04927-7.

Multifactor dimensionality reduction reveals the effect of interaction between ERAP1 and IFIH1 polymorphisms in psoriasis susceptibility genes.多因素降维分析揭示了内质网氨肽酶1（ERAP1）和干扰素诱导解旋酶C结构域蛋白1（IFIH1）基因多态性之间的相互作用对银屑病易感基因的影响。

Front Genet. 2022 Nov 8;13:1009589. doi: 10.3389/fgene.2022.1009589. eCollection 2022.

KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis.KLFDAPC：一种用于空间遗传结构分析的有监督机器学习方法。

Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac202.

Genomic prediction for fusiform rust disease incidence in a large cloned population of Pinus taeda.在大型火炬松克隆群体中对柄锈菌病发病率进行基因组预测。

G3 (Bethesda). 2021 Sep 6;11(9). doi: 10.1093/g3journal/jkab235.

Robustification of GWAS to explore effective SNPs addressing the challenges of hidden population stratification and polygenic effects.稳健化 GWAS 以探索有效的 SNPs，解决潜在人群分层和多基因效应的挑战。

Sci Rep. 2021 Jun 22;11(1):13060. doi: 10.1038/s41598-021-90774-7.

Machine Learning Model for Predicting Postoperative Survival of Patients with Colorectal Cancer.用于预测结直肠癌患者术后生存率的机器学习模型

Cancer Res Treat. 2022 Apr;54(2):517-524. doi: 10.4143/crt.2021.206. Epub 2021 Jun 15.

Identification of epistasis loci underlying rice flowering time by controlling population stratification and polygenic effect.通过控制群体分层和多基因效应鉴定水稻开花时间的上位性位点。

DNA Res. 2019 Apr 1;26(2):119-130. doi: 10.1093/dnares/dsy043.

Oceanographic variation influences spatial genomic structure in the sea scallop, .海洋学变化影响海扇贝的空间基因组结构。

Ecol Evol. 2018 Feb 11;8(5):2824-2841. doi: 10.1002/ece3.3846. eCollection 2018 Mar.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用加权最近邻法推断缺失基因型。

Imputing missing genotypes with weighted k nearest neighbors.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献