识别标记错误和污染的 DNA 甲基化微阵列数据：来自 GEO 的示例扩展的质量控制工具集。

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO.

机构信息

Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1057, New York, 10029 NY USA.

出版信息

Clin Epigenetics. 2018 Jun 1;10:73. doi: 10.1186/s13148-018-0504-1. eCollection 2018.

DOI:10.1186/s13148-018-0504-1

PMID:29881472

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5984806/

Abstract

BACKGROUND

Mislabeled, contaminated or poorly performing samples can threaten power in methylation microarray analyses or even result in spurious associations. We describe a set of quality checks for the popular Illumina 450K and EPIC microarrays to identify problematic samples and demonstrate their application in publicly available datasets.

METHODS

Quality checks implemented here include 17 control metrics defined by the manufacturer, a sex check to detect mislabeled sex-discordant samples, and both an identity check for fingerprinting sample donors and a measure of sample contamination based on probes querying high-frequency SNPs. These checks were tested on 80 datasets comprising 8327 samples run on the 450K microarray from the GEO repository.

RESULTS

Nine hundred forty samples were flagged by at least one control metric and 133 samples from 20 datasets were assigned the wrong sex. In a dataset in which a subset of samples appear contaminated with a single source of DNA, we demonstrate that our measure based on outliers among SNP probes was strongly correlated (> 0.95) with another independent measure of contamination.

CONCLUSIONS

A more complete examination of samples that may be mislabeled, contaminated, or have poor performance due to technical problems will improve downstream analyses and replication of findings. We demonstrate that quality control problems are prevalent in a public repository of DNA methylation data. We advocate for a more thorough quality control workflow in epigenome-wide association studies and provide a software package to perform the checks described in this work. Reproducible code and supplementary material are available at 10.5281/zenodo.1172730.

摘要

背景

标记错误、污染或性能不佳的样本可能会威胁甲基化微阵列分析的结果，甚至导致虚假关联。我们描述了一组适用于流行的 Illumina 450K 和 EPIC 微阵列的质量检查，以识别有问题的样本，并展示其在公开可用数据集上的应用。

方法

这里实施的质量检查包括制造商定义的 17 个控制指标、用于检测性别不一致的标记错误的性别检查，以及用于识别样本供体身份的指纹检查和基于探针查询高频 SNP 的样本污染测量。这些检查在 80 个数据集上进行了测试，这些数据集包含了来自 GEO 存储库的 450K 微阵列运行的 8327 个样本。

结果

至少有一个控制指标标记了 940 个样本，20 个数据集的 133 个样本被分配了错误的性别。在一个数据集的一部分样本似乎被单一来源的 DNA 污染的情况下，我们证明了我们基于 SNP 探针异常值的测量与另一种独立的污染测量高度相关（>0.95）。

结论

对可能由于技术问题而标记错误、污染或性能不佳的样本进行更全面的检查，将改善下游分析和结果的复制。我们证明了质量控制问题在公共 DNA 甲基化数据存储库中很普遍。我们提倡在全基因组关联研究中进行更彻底的质量控制工作，并提供一个软件包来执行本工作中描述的检查。可重复的代码和补充材料可在 10.5281/zenodo.1172730 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c35/5984806/80e2e1b5398c/13148_2018_504_Fig1_HTML.jpg

相似文献

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO.

Clin Epigenetics. 2018 Jun 1;10:73. doi: 10.1186/s13148-018-0504-1. eCollection 2018.

Improved filtering of DNA methylation microarray data by detection p values and its impact on downstream analyses.

Clin Epigenetics. 2019 Jan 24;11(1):15. doi: 10.1186/s13148-019-0615-3.

"Gap hunting" to characterize clustered probe signals in Illumina methylation array data.

Epigenetics Chromatin. 2016 Dec 7;9:56. doi: 10.1186/s13072-016-0107-z. eCollection 2016.

MethylToSNP: identifying SNPs in Illumina DNA methylation array data.

Epigenetics Chromatin. 2019 Dec 20;12(1):79. doi: 10.1186/s13072-019-0321-6.

Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation.

Epigenomics. 2012 Jun;4(3):325-41. doi: 10.2217/epi.12.21.

Correlation of Infinium HumanMethylation450K and MethylationEPIC BeadChip arrays in cartilage.

Epigenetics. 2020 Jun-Jul;15(6-7):594-603. doi: 10.1080/15592294.2019.1700003. Epub 2019 Dec 13.

Comparison of pre-processing methodologies for Illumina 450k methylation array data in familial analyses.

Clin Epigenetics. 2016 Jul 16;8:75. doi: 10.1186/s13148-016-0241-2. eCollection 2016.

MethylAid: visual and interactive quality control of large Illumina 450k datasets.

Bioinformatics. 2014 Dec 1;30(23):3435-7. doi: 10.1093/bioinformatics/btu566. Epub 2014 Aug 21.

Single nucleotide polymorphisms on DNA methylation microarrays: precautions against confounding.

Epigenomics. 2014;6(6):577-9. doi: 10.2217/epi.14.55.

Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray.

Epigenetics. 2013 Feb;8(2):203-9. doi: 10.4161/epi.23470. Epub 2013 Jan 11.

引用本文的文献

Study protocol for the Health Outcomes in Pregnancy and Early Childhood (HOPE) Study: A mother-infant study in American Samoa.

PLoS One. 2025 Sep 15;20(9):e0326644. doi: 10.1371/journal.pone.0326644. eCollection 2025.

Clin Epigenetics. 2025 Sep 9;17(1):148. doi: 10.1186/s13148-025-01968-z.

Gestational phthalate levels and biomarkers of aging in infants and children from New York City.

Environ Res. 2025 Aug 12;285(Pt 4):122583. doi: 10.1016/j.envres.2025.122583.

DNA methylation associations with cognitive function in early-stage hormone receptor-positive breast cancer patients.

Epigenomics. 2025 Sep;17(13):879-889. doi: 10.1080/17501911.2025.2542116. Epub 2025 Aug 6.

Newborn mitochondrial DNA copy number is associated with changes to DNA methylation that persist into childhood and are associated with cognitive development.

Clin Epigenetics. 2025 Jul 2;17(1):112. doi: 10.1186/s13148-025-01896-y.

Epigenome-wide association study of cerebrospinal fluid-based biomarkers of Alzheimer's disease in cognitively normal individuals.

Alzheimers Dement. 2025 Jun;21(6):e70318. doi: 10.1002/alz.70318.

Discrimination, Coping, and DNAm Accelerated Aging Among African American Mothers of the InterGEN Study.

Epigenomes. 2025 May 4;9(2):14. doi: 10.3390/epigenomes9020014.

Isolating the effects of HIV infection and HIV exposure on epigenetic profiles in infants using historical data from the Mothers and Infants Cohort Study.

EBioMedicine. 2025 May;115:105696. doi: 10.1016/j.ebiom.2025.105696. Epub 2025 Apr 26.

Applying blood-derived epigenetic algorithms to saliva: cross-tissue similarity of DNA-methylation indices of aging, physiology, and cognition.

Clin Epigenetics. 2025 Apr 23;17(1):61. doi: 10.1186/s13148-025-01868-2.

Maternal epigenetic index links early neglect to later neglectful care and other psychopathological, cognitive, and bonding effects.

Clin Epigenetics. 2025 Mar 8;17(1):46. doi: 10.1186/s13148-025-01839-7.

本文引用的文献

Cohort Profile: Pregnancy And Childhood Epigenetics (PACE) Consortium.

Int J Epidemiol. 2018 Feb 1;47(1):22-23u. doi: 10.1093/ije/dyx190.

Maternal blood contamination of collected cord blood can be identified using DNA methylation at three CpGs.

Clin Epigenetics. 2017 Jul 25;9:75. doi: 10.1186/s13148-017-0370-2. eCollection 2017.

The cancer epigenome: Concepts, challenges, and therapeutic opportunities.

Science. 2017 Mar 17;355(6330):1147-1152. doi: 10.1126/science.aam7304. Epub 2017 Mar 16.

RELIC: a novel dye-bias correction method for Illumina Methylation BeadChip.

BMC Genomics. 2017 Jan 3;18(1):4. doi: 10.1186/s12864-016-3426-3.

Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes.

Nucleic Acids Res. 2017 Feb 28;45(4):e22. doi: 10.1093/nar/gkw967.

Training a model for estimating leukocyte composition using whole-blood DNA methylation and cell counts as reference.

Epigenomics. 2017 Jan;9(1):13-20. doi: 10.2217/epi-2016-0091. Epub 2016 Nov 25.

Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies.

F1000Res. 2016 Aug 30;5:2103. doi: 10.12688/f1000research.9471.2. eCollection 2016.

A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies.

Genome Biol. 2015 Feb 15;16(1):37. doi: 10.1186/s13059-015-0600-x.

Rapid reprogramming of epigenetic and transcriptional profiles in mammalian culture systems.

Genome Biol. 2015 Feb 4;16(1):11. doi: 10.1186/s13059-014-0576-y.

shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R.

F1000Res. 2014 Jul 30;3:175. doi: 10.12688/f1000research.4680.2. eCollection 2014.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

识别标记错误和污染的 DNA 甲基化微阵列数据：来自 GEO 的示例扩展的质量控制工具集。

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO.

机构信息

Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1057, New York, 10029 NY USA.