Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1057, New York, 10029 NY USA.
Clin Epigenetics. 2018 Jun 1;10:73. doi: 10.1186/s13148-018-0504-1. eCollection 2018.
Mislabeled, contaminated or poorly performing samples can threaten power in methylation microarray analyses or even result in spurious associations. We describe a set of quality checks for the popular Illumina 450K and EPIC microarrays to identify problematic samples and demonstrate their application in publicly available datasets.
Quality checks implemented here include 17 control metrics defined by the manufacturer, a sex check to detect mislabeled sex-discordant samples, and both an identity check for fingerprinting sample donors and a measure of sample contamination based on probes querying high-frequency SNPs. These checks were tested on 80 datasets comprising 8327 samples run on the 450K microarray from the GEO repository.
Nine hundred forty samples were flagged by at least one control metric and 133 samples from 20 datasets were assigned the wrong sex. In a dataset in which a subset of samples appear contaminated with a single source of DNA, we demonstrate that our measure based on outliers among SNP probes was strongly correlated (> 0.95) with another independent measure of contamination.
A more complete examination of samples that may be mislabeled, contaminated, or have poor performance due to technical problems will improve downstream analyses and replication of findings. We demonstrate that quality control problems are prevalent in a public repository of DNA methylation data. We advocate for a more thorough quality control workflow in epigenome-wide association studies and provide a software package to perform the checks described in this work. Reproducible code and supplementary material are available at 10.5281/zenodo.1172730.
标记错误、污染或性能不佳的样本可能会威胁甲基化微阵列分析的结果,甚至导致虚假关联。我们描述了一组适用于流行的 Illumina 450K 和 EPIC 微阵列的质量检查,以识别有问题的样本,并展示其在公开可用数据集上的应用。
这里实施的质量检查包括制造商定义的 17 个控制指标、用于检测性别不一致的标记错误的性别检查,以及用于识别样本供体身份的指纹检查和基于探针查询高频 SNP 的样本污染测量。这些检查在 80 个数据集上进行了测试,这些数据集包含了来自 GEO 存储库的 450K 微阵列运行的 8327 个样本。
至少有一个控制指标标记了 940 个样本,20 个数据集的 133 个样本被分配了错误的性别。在一个数据集的一部分样本似乎被单一来源的 DNA 污染的情况下,我们证明了我们基于 SNP 探针异常值的测量与另一种独立的污染测量高度相关(>0.95)。
对可能由于技术问题而标记错误、污染或性能不佳的样本进行更全面的检查,将改善下游分析和结果的复制。我们证明了质量控制问题在公共 DNA 甲基化数据存储库中很普遍。我们提倡在全基因组关联研究中进行更彻底的质量控制工作,并提供一个软件包来执行本工作中描述的检查。可重复的代码和补充材料可在 10.5281/zenodo.1172730 上获得。