用于过滤全基因组关联研究中 SNPs 的质量控制算法。

A quality control algorithm for filtering SNPs in genome-wide association studies.

机构信息

Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695-7566, USA.

出版信息

Bioinformatics. 2010 Jul 15;26(14):1731-7. doi: 10.1093/bioinformatics/btq272. Epub 2010 May 25.

DOI:10.1093/bioinformatics/btq272

PMID:20501555

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2894516/

Abstract

MOTIVATION

The quality control (QC) filtering of single nucleotide polymorphisms (SNPs) is an important step in genome-wide association studies to minimize potential false findings. SNP QC commonly uses expert-guided filters based on QC variables [e.g. Hardy-Weinberg equilibrium, missing proportion (MSP) and minor allele frequency (MAF)] to remove SNPs with insufficient genotyping quality. The rationale of the expert filters is sensible and concrete, but its implementation requires arbitrary thresholds and does not jointly consider all QC features.

RESULTS

We propose an algorithm that is based on principal component analysis and clustering analysis to identify low-quality SNPs. The method minimizes the use of arbitrary cutoff values, allows a collective consideration of the QC features and provides conditional thresholds contingent on other QC variables (e.g. different MSP thresholds for different MAFs). We apply our method to the seven studies from the Wellcome Trust Case Control Consortium and the major depressive disorder study from the Genetic Association Information Network. We measured the performance of our method compared to the expert filters based on the following criteria: (i) percentage of SNPs excluded due to low quality; (ii) inflation factor of the test statistics (lambda); (iii) number of false associations found in the filtered dataset; and (iv) number of true associations missed in the filtered dataset. The results suggest that with the same or fewer SNPs excluded, the proposed algorithm tends to give a similar or lower value of lambda, a reduced number of false associations, and retains all true associations.

AVAILABILITY

The algorithm is available at http://www4.stat.ncsu.edu/jytzeng/software.php

摘要

动机

质量控制 (QC) 过滤单核苷酸多态性 (SNP) 是全基因组关联研究中的一个重要步骤，可最大限度地减少潜在的错误发现。SNP QC 通常使用基于 QC 变量的专家指导过滤器（例如 Hardy-Weinberg 平衡、缺失比例 (MSP) 和次要等位基因频率 (MAF)）来去除基因型质量不足的 SNP。专家过滤器的原理是合理且具体的，但它的实施需要任意的阈值，并且不能共同考虑所有 QC 特征。

结果

我们提出了一种基于主成分分析和聚类分析的算法来识别低质量 SNP。该方法最大限度地减少了任意截止值的使用，允许集体考虑 QC 特征，并根据其他 QC 变量（例如，不同 MAF 的不同 MSP 阈值）提供条件阈值。我们将我们的方法应用于来自 Wellcome Trust 病例对照联盟的七项研究和来自遗传关联信息网络的重度抑郁症研究。我们根据以下标准衡量我们的方法与专家过滤器的性能：(i) 由于质量低而排除的 SNP 百分比；(ii) 检验统计量的膨胀因子 (lambda)；(iii) 在过滤数据集发现的虚假关联数量；和 (iv) 在过滤数据集错过的真实关联数量。结果表明，使用相同或更少的 SNP 排除，所提出的算法往往会给出相似或更低的 lambda 值、更少的虚假关联，并保留所有真实关联。

可用性

该算法可在 http://www4.stat.ncsu.edu/jytzeng/software.php 获得。

相似文献

A quality control algorithm for filtering SNPs in genome-wide association studies.用于过滤全基因组关联研究中 SNPs 的质量控制算法。

Bioinformatics. 2010 Jul 15;26(14):1731-7. doi: 10.1093/bioinformatics/btq272. Epub 2010 May 25.

Quality control for genome-wide association studies.全基因组关联研究的质量控制

Methods Mol Biol. 2010;628:341-72. doi: 10.1007/978-1-60327-367-1_19.

Comparing the efficacy of SNP filtering methods for identifying a single causal SNP in a known association region.比较单核苷酸多态性（SNP）过滤方法在已知关联区域中识别单个因果SNP的功效。

Ann Hum Genet. 2014 Jan;78(1):50-61. doi: 10.1111/ahg.12043. Epub 2013 Nov 11.

A simple and fast two-locus quality control test to detect false positives due to batch effects in genome-wide association studies.一种简单快速的双位点质量控制测试，可检测全基因组关联研究中由于批次效应导致的假阳性。

Genet Epidemiol. 2010 Dec;34(8):854-62. doi: 10.1002/gepi.20541.

Aggregating single nucleotide polymorphisms improves filtering for false-positive associations postimputation.聚合单核苷酸多态性可改善对插补后假阳性关联的过滤。

G3 (Bethesda). 2025 May 8;15(5). doi: 10.1093/g3journal/jkaf043.

GWIS--model-free, fast and exhaustive search for epistatic interactions in case-control GWAS.GWIS--无模型、快速且全面搜索病例对照 GWAS 中的上位相互作用。

BMC Genomics. 2013;14 Suppl 3(Suppl 3):S10. doi: 10.1186/1471-2164-14-S3-S10. Epub 2013 May 28.

How to link call rate and p-values for Hardy-Weinberg equilibrium as measures of genome-wide SNP data quality.如何将连锁率和 p 值与 Hardy-Weinberg 平衡一起作为全基因组 SNP 数据质量的衡量标准。

Stat Med. 2010 Sep 30;29(22):2347-58. doi: 10.1002/sim.4004.

Postassociation cleaning using linkage disequilibrium information.基于连锁不平衡信息的后关联清洗。

Genet Epidemiol. 2011 Jan;35(1):1-10. doi: 10.1002/gepi.20544.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.快速准确地推断汇总统计数据可增强功能富集的证据。

Bioinformatics. 2014 Oct 15;30(20):2906-14. doi: 10.1093/bioinformatics/btu416. Epub 2014 Jul 1.

Increasing power of genome-wide association studies by collecting additional single-nucleotide polymorphisms.通过收集额外的单核苷酸多态性来提高全基因组关联研究的效力。

Genetics. 2011 Jun;188(2):449-60. doi: 10.1534/genetics.111.128595. Epub 2011 Apr 5.

引用本文的文献

Causal Characteristics of Immune Cells Associated with Aortic Dissection: A Mendelian Randomisation Analysis.与主动脉夹层相关的免疫细胞的因果特征：孟德尔随机化分析

Eur Cardiol. 2025 Apr 1;20:e07. doi: 10.15420/ecr.2024.44. eCollection 2025.

Association of Metabolic Diseases and Moderate Fat Intake with Myocardial Infarction Risk.代谢疾病及适度脂肪摄入与心肌梗死风险的关联

Nutrients. 2024 Dec 11;16(24):4273. doi: 10.3390/nu16244273.

Molecular markers and molecular basis of plant type related traits in maize.玉米株型相关性状的分子标记及分子基础

Front Genet. 2024 Nov 1;15:1487700. doi: 10.3389/fgene.2024.1487700. eCollection 2024.

Integrating dynamic high-throughput phenotyping and genetic analysis to monitor growth variation in foxtail millet.整合动态高通量表型分析与遗传分析以监测谷子生长变异

Plant Methods. 2024 Nov 5;20(1):168. doi: 10.1186/s13007-024-01295-z.

Missing genotype imputation in non-model species using self-organizing maps.使用自组织映射对非模式物种进行缺失基因型填充

Mol Ecol Resour. 2025 Apr;25(3):e13992. doi: 10.1111/1755-0998.13992. Epub 2024 Jul 6.

Genome-wide association studies provide genetic insights into natural variation of seed-size-related traits in mungbean.全基因组关联研究为绿豆种子大小相关性状的自然变异提供了遗传学见解。

Front Plant Sci. 2022 Oct 13;13:997988. doi: 10.3389/fpls.2022.997988. eCollection 2022.

Whole Genome Multi-Locus Sequence Typing and Genomic Single Nucleotide Polymorphism Analysis for Epidemiological Typing of From Indonesian Intensive Care Units.用于印度尼西亚重症监护病房流行病学分型的全基因组多位点序列分型和基因组单核苷酸多态性分析

Front Microbiol. 2022 Jul 14;13:861222. doi: 10.3389/fmicb.2022.861222. eCollection 2022.

Establishing analytical validity of BeadChip array genotype data by comparison to whole-genome sequence and standard benchmark datasets.通过与全基因组序列和标准基准数据集进行比较，确立 BeadChip 芯片基因分型数据的分析有效性。

BMC Med Genomics. 2022 Mar 14;15(1):56. doi: 10.1186/s12920-022-01199-8.

Genomics models in radiotherapy: From mechanistic to machine learning.放射治疗中的基因组学模型：从机制到机器学习。

Med Phys. 2020 Jun;47(5):e203-e217. doi: 10.1002/mp.13751.

Genome-Wide Association Studies of 11 Agronomic Traits in Cassava ( Crantz).木薯（Crantz）11个农艺性状的全基因组关联研究

Front Plant Sci. 2018 Apr 19;9:503. doi: 10.3389/fpls.2018.00503. eCollection 2018.

本文引用的文献

On quality control measures in genome-wide association studies: a test to assess the genotyping quality of individual probands in family-based association studies and an application to the HapMap data.关于全基因组关联研究中的质量控制措施：一种评估基于家系的关联研究中个体先证者基因分型质量的测试及其在HapMap数据中的应用。

PLoS Genet. 2009 Jul;5(7):e1000572. doi: 10.1371/journal.pgen.1000572. Epub 2009 Jul 24.

Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.全基因组关联位点对人类疾病和性状的潜在病因学及功能影响。

Proc Natl Acad Sci U S A. 2009 Jun 9;106(23):9362-7. doi: 10.1073/pnas.0903103106. Epub 2009 May 27.

Progress and challenges in genome-wide association studies in humans.人类全基因组关联研究的进展与挑战

Nature. 2008 Dec 11;456(7223):728-31. doi: 10.1038/nature07631.

Genome-wide association for major depressive disorder: a possible role for the presynaptic protein piccolo.全基因组关联研究重度抑郁症：突触前蛋白 piccolo 的潜在作用。

Mol Psychiatry. 2009 Apr;14(4):359-75. doi: 10.1038/mp.2008.125. Epub 2008 Dec 9.

SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and European populations.KCNQ1基因中的单核苷酸多态性与东亚和欧洲人群的2型糖尿病易感性相关。

Nat Genet. 2008 Sep;40(9):1098-102. doi: 10.1038/ng.208.

Appropriate data cleaning methods for genome-wide association study.全基因组关联研究的适当数据清理方法。

J Hum Genet. 2008;53(10):886-893. doi: 10.1007/s10038-008-0322-y. Epub 2008 Aug 12.

Non-random error in genotype calling procedures: implications for family-based and case-control genome-wide association studies.基因分型程序中的非随机误差：对基于家系和病例对照的全基因组关联研究的影响。

Am J Med Genet B Neuropsychiatr Genet. 2008 Dec 5;147B(8):1379-86. doi: 10.1002/ajmg.b.30836.

Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease.全基因组关联研究确定了30多个克罗恩病的不同易感基因座。

Nat Genet. 2008 Aug;40(8):955-62. doi: 10.1038/ng.175. Epub 2008 Jun 29.

The positives, protocols, and perils of genome-wide association.全基因组关联研究的优势、方案与风险

Am J Med Genet B Neuropsychiatr Genet. 2008 Oct 5;147B(7):1288-94. doi: 10.1002/ajmg.b.30747.

Common statistical issues in genome-wide association studies: a review on power, data quality control, genotype calling and population structure.全基因组关联研究中的常见统计问题：关于效能、数据质量控制、基因型分型及群体结构的综述

Curr Opin Lipidol. 2008 Apr;19(2):133-43. doi: 10.1097/MOL.0b013e3282f5dd77.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验