使用270个HapMap样本评估基因分型算法BRLMM对Affymetrix GeneChip Human Mapping 500 K芯片组的批次效应。

Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.

作者信息

Hong Huixiao, Su Zhenqiang, Ge Weigong, Shi Leming, Perkins Roger, Fang Hong, Xu Joshua, Chen James J, Han Tao, Kaput Jim, Fuscoe James C, Tong Weida

机构信息

Division of Systems Toxicology, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.

出版信息

BMC Bioinformatics. 2008 Aug 12;9 Suppl 9(Suppl 9):S17. doi: 10.1186/1471-2105-9-S9-S17.

DOI:10.1186/1471-2105-9-S9-S17

PMID:18793462

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2537568/

Abstract

BACKGROUND

Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set.

RESULTS

Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls.

CONCLUSION

Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.

摘要

背景

全基因组关联研究（GWAS）旨在识别整个人类基因组中与疾病状态和药物反应等表型特征相关的基因变异（通常是单核苷酸多态性 [SNP]）。高度准确且可重复的基因型分型至关重要，因为分型算法引入的错误可能导致基因型与表型之间错误关联的膨胀。目前用于GWAS的大多数基因型分型算法基于多个阵列。由于GWAS会生成数百吉字节（GB）的原始数据，样本通常被分成包含整个数据集子集的批次用于基因型分型。已经实现了高分型率和准确性。然而，批次大小（即一起分析的芯片数量）和批次组成（即批次中芯片的选择）对分型率和准确性的影响以及这些影响在已鉴定的显著相关SNP中的传播尚未得到研究。在本文中，我们使用Affymetrix Human Mapping 500 K阵列集分析的270个HapMap样本的原始数据，分析了批次大小和批次组成对基因型分型算法BRLMM的影响。

结果

使用Affymetrix Human Mapping 500 K阵列集检测的270个HapMap样本的数据，使用BRLMM算法对三种不同的批次大小和三种不同的批次组成进行基因分型。对分型结果和通过关联分析确定的相应显著SNP列表进行比较分析，结果表明批次大小和组成均影响基因型分型结果和显著相关的SNP。与高分型率的样本和SNP相比，批次大小和批次组成对分型率较低的样本和SNP以及杂合基因型分型的影响更为严重。

结论

批次大小和组成会影响使用BRLMM的GWAS中的基因型分型结果。批次大小差异越大，影响越大。批次中的样本越均匀，基因型分型就越一致。这种不一致会传播到下游关联分析中确定的显著相关SNP列表中。因此，应使用统一且大的批次大小进行GWAS的基因型分型。此外，应将高同质性的样本放入同一批次中。

相似文献

Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.

BMC Bioinformatics. 2008 Aug 12;9 Suppl 9(Suppl 9):S17. doi: 10.1186/1471-2105-9-S9-S17.

Evaluating variations of genotype calling: a potential source of spurious associations in genome-wide association studies.

J Genet. 2010 Apr;89(1):55-64. doi: 10.1007/s12041-010-0011-4.

Batch effects in the BRLMM genotype calling algorithm influence GWAS results for the Affymetrix 500K array.

Pharmacogenomics J. 2010 Aug;10(4):336-46. doi: 10.1038/tpj.2010.36.

SNiPer: improved SNP genotype calling for Affymetrix 10K GeneChip microarray data.

BMC Genomics. 2005 Oct 31;6:149. doi: 10.1186/1471-2164-6-149.

SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays.

Bioinformatics. 2007 Jan 1;23(1):57-63. doi: 10.1093/bioinformatics/btl536. Epub 2006 Oct 24.

Assessing consistency between versions of genotype-calling algorithm Birdseed for the Genome-Wide Human SNP Array 6.0 using HapMap samples.

Adv Exp Med Biol. 2010;680:355-60. doi: 10.1007/978-1-4419-5913-3_40.

A genotype calling algorithm for the Illumina BeadArray platform.

Bioinformatics. 2007 Oct 15;23(20):2741-6. doi: 10.1093/bioinformatics/btm443. Epub 2007 Sep 10.

Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays.

Bioinformatics. 2005 May 1;21(9):1958-63. doi: 10.1093/bioinformatics/bti275. Epub 2005 Jan 18.

A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays.

Bioinformatics. 2007 Jun 15;23(12):1459-67. doi: 10.1093/bioinformatics/btm131. Epub 2007 Apr 25.

Variability in GWAS analysis: the impact of genotype calling algorithm inconsistencies.

Pharmacogenomics J. 2010 Aug;10(4):324-35. doi: 10.1038/tpj.2010.46.

引用本文的文献

Genetic analysis of right heart structure and function in 40,000 people.

Nat Genet. 2022 Jun;54(6):792-803. doi: 10.1038/s41588-022-01090-3. Epub 2022 Jun 13.

Establishing analytical validity of BeadChip array genotype data by comparison to whole-genome sequence and standard benchmark datasets.

BMC Med Genomics. 2022 Mar 14;15(1):56. doi: 10.1186/s12920-022-01199-8.

Nanomaterial Databases: Data Sources for Promoting Design and Risk Assessment of Nanomaterials.

Nanomaterials (Basel). 2021 Jun 18;11(6):1599. doi: 10.3390/nano11061599.

Genotype calling of triploid offspring from diploid parents.

Genet Sel Evol. 2020 Mar 18;52(1):15. doi: 10.1186/s12711-020-00534-w.

SNP genotype calling and quality control for multi-batch-based studies.

Genes Genomics. 2019 Aug;41(8):927-939. doi: 10.1007/s13258-019-00827-5. Epub 2019 May 6.

Identifying and mitigating batch effects in whole genome sequencing data.

BMC Bioinformatics. 2017 Jul 24;18(1):351. doi: 10.1186/s12859-017-1756-z.

Re-evaluating data quality of dog mitochondrial, Y chromosomal, and autosomal SNPs genotyped by SNP array.

Zool Res. 2016 Nov 18;37(6):356-360. doi: 10.13918/j.issn.2095-8137.2016.6.356.

Genomic Discoveries and Personalized Medicine in Neurological Diseases.

Pharmaceutics. 2015 Dec 7;7(4):542-53. doi: 10.3390/pharmaceutics7040542.

Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine.

Pharmaceutics. 2015 Nov 23;7(4):523-41. doi: 10.3390/pharmaceutics7040523.

Quality control metrics improve repeatability and reproducibility of single-nucleotide variants derived from whole-genome sequencing.

Pharmacogenomics J. 2015 Aug;15(4):298-309. doi: 10.1038/tpj.2014.70. Epub 2014 Nov 11.

本文引用的文献

Whole genome-wide association study using affymetrix SNP chip: a two-stage sequential selection method to identify genes that increase the risk of developing complex diseases.

Methods Mol Med. 2008;141:23-35. doi: 10.1007/978-1-60327-148-6_2.

Three genome-wide association studies and a linkage analysis identify HERC2 as a human iris color gene.

Am J Hum Genet. 2008 Feb;82(2):411-23. doi: 10.1016/j.ajhg.2007.10.003. Epub 2008 Jan 25.

Genome-wide association study shows BCL11A associated with persistent fetal hemoglobin and amelioration of the phenotype of beta-thalassemia.

Proc Natl Acad Sci U S A. 2008 Feb 5;105(5):1620-5. doi: 10.1073/pnas.0711566105. Epub 2008 Feb 1.

A common genetic variant in the neurexin superfamily member CNTNAP2 increases familial risk of autism.

Am J Hum Genet. 2008 Jan;82(1):160-4. doi: 10.1016/j.ajhg.2007.09.015.

Genome-wide quantitative trait locus association scan of general cognitive ability using pooled DNA and 500K single nucleotide polymorphism microarrays.

Genes Brain Behav. 2008 Jun;7(4):435-46. doi: 10.1111/j.1601-183X.2007.00368.x. Epub 2008 Jan 22.

A second generation human haplotype map of over 3.1 million SNPs.

Nature. 2007 Oct 18;449(7164):851-61. doi: 10.1038/nature06258.

Genome-wide association study for Crohn's disease in the Quebec Founder Population identifies multiple validated disease loci.

Proc Natl Acad Sci U S A. 2007 Sep 11;104(37):14747-52. doi: 10.1073/pnas.0706645104. Epub 2007 Sep 5.

Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions.

Nat Genet. 2007 Aug;39(8):1000-6. doi: 10.1038/ng2099. Epub 2007 Jul 18.

A genome-wide association scan identifies the hepatic cholesterol transporter ABCG8 as a susceptibility factor for human gallstone disease.

Nat Genet. 2007 Aug;39(8):995-9. doi: 10.1038/ng2101. Epub 2007 Jul 15.

A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21.

Nat Genet. 2007 Aug;39(8):984-8. doi: 10.1038/ng2085. Epub 2007 Jul 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用270个HapMap样本评估基因分型算法BRLMM对Affymetrix GeneChip Human Mapping 500 K芯片组的批次效应。

Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献