Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695-7566, USA.
Bioinformatics. 2010 Jul 15;26(14):1731-7. doi: 10.1093/bioinformatics/btq272. Epub 2010 May 25.
The quality control (QC) filtering of single nucleotide polymorphisms (SNPs) is an important step in genome-wide association studies to minimize potential false findings. SNP QC commonly uses expert-guided filters based on QC variables [e.g. Hardy-Weinberg equilibrium, missing proportion (MSP) and minor allele frequency (MAF)] to remove SNPs with insufficient genotyping quality. The rationale of the expert filters is sensible and concrete, but its implementation requires arbitrary thresholds and does not jointly consider all QC features.
We propose an algorithm that is based on principal component analysis and clustering analysis to identify low-quality SNPs. The method minimizes the use of arbitrary cutoff values, allows a collective consideration of the QC features and provides conditional thresholds contingent on other QC variables (e.g. different MSP thresholds for different MAFs). We apply our method to the seven studies from the Wellcome Trust Case Control Consortium and the major depressive disorder study from the Genetic Association Information Network. We measured the performance of our method compared to the expert filters based on the following criteria: (i) percentage of SNPs excluded due to low quality; (ii) inflation factor of the test statistics (lambda); (iii) number of false associations found in the filtered dataset; and (iv) number of true associations missed in the filtered dataset. The results suggest that with the same or fewer SNPs excluded, the proposed algorithm tends to give a similar or lower value of lambda, a reduced number of false associations, and retains all true associations.
The algorithm is available at http://www4.stat.ncsu.edu/jytzeng/software.php
质量控制 (QC) 过滤单核苷酸多态性 (SNP) 是全基因组关联研究中的一个重要步骤,可最大限度地减少潜在的错误发现。SNP QC 通常使用基于 QC 变量的专家指导过滤器(例如 Hardy-Weinberg 平衡、缺失比例 (MSP) 和次要等位基因频率 (MAF))来去除基因型质量不足的 SNP。专家过滤器的原理是合理且具体的,但它的实施需要任意的阈值,并且不能共同考虑所有 QC 特征。
我们提出了一种基于主成分分析和聚类分析的算法来识别低质量 SNP。该方法最大限度地减少了任意截止值的使用,允许集体考虑 QC 特征,并根据其他 QC 变量(例如,不同 MAF 的不同 MSP 阈值)提供条件阈值。我们将我们的方法应用于来自 Wellcome Trust 病例对照联盟的七项研究和来自遗传关联信息网络的重度抑郁症研究。我们根据以下标准衡量我们的方法与专家过滤器的性能:(i) 由于质量低而排除的 SNP 百分比;(ii) 检验统计量的膨胀因子 (lambda);(iii) 在过滤数据集发现的虚假关联数量;和 (iv) 在过滤数据集错过的真实关联数量。结果表明,使用相同或更少的 SNP 排除,所提出的算法往往会给出相似或更低的 lambda 值、更少的虚假关联,并保留所有真实关联。