Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, 695 Charles E. Young Drive South, Los Angeles, CA, 90095-176, USA.
Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA.
Genome Biol. 2022 Oct 24;23(1):225. doi: 10.1186/s13059-022-02793-w.
DNA methylation (DNAm)-based predictors hold great promise to serve as clinical tools for health interventions and disease management. While these algorithms often have high prediction accuracy, the consistency of their performance remains to be determined. We therefore conduct a systematic evaluation across 101 different DNAm data preprocessing and normalization strategies and assess how each analytical strategy affects the consistency of 41 DNAm-based predictors.
Our analyses are conducted in a large EPIC DNAm array dataset from the Jackson Heart Study (N = 2053) that included 146 pairs of technical replicate samples. By estimating the average absolute agreement between replicate pairs, we show that 32 out of 41 predictors (78%) demonstrate excellent consistency when appropriate data processing and normalization steps are implemented. Across all pairs of predictors, we find a moderate correlation in performance across analytical strategies (mean rho = 0.40, SD = 0.27), highlighting significant heterogeneity in performance across algorithms. Successful or unsuccessful removal of technical variation furthermore significantly impacts downstream phenotypic association analysis, such as all-cause mortality risk associations.
We show that DNAm-based algorithms are sensitive to technical variation. The right choice of data processing strategy is important to achieve reproducible estimates and improve prediction accuracy in downstream phenotypic association analyses. For each of the 41 DNAm predictors, we report its degree of consistency and provide the best performing analytical strategy as a guideline for the research community. As DNAm-based predictors become more and more widely used, our work helps improve their performance and standardize their implementation.
基于 DNA 甲基化(DNAm)的预测因子有望成为健康干预和疾病管理的临床工具。虽然这些算法通常具有较高的预测准确性,但它们的性能一致性仍有待确定。因此,我们对 101 种不同的 DNAm 数据预处理和标准化策略进行了系统评估,并评估了每种分析策略如何影响 41 种基于 DNAm 的预测因子的一致性。
我们的分析是在杰克逊心脏研究(N=2053)的大型 EPIC DNAm 阵列数据集中进行的,其中包括 146 对技术重复样本。通过估计重复对之间的平均绝对一致性,我们表明,在实施适当的数据处理和标准化步骤时,41 个预测因子中的 32 个(78%)表现出极好的一致性。在所有预测因子对中,我们发现分析策略之间的性能相关性适中(平均 rho=0.40,SD=0.27),突出了算法之间性能的显著异质性。技术变异的成功或不成功去除也会显著影响下游表型关联分析,例如全因死亡率风险关联。
我们表明,基于 DNAm 的算法对技术变异敏感。选择正确的数据处理策略对于实现可重复的估计并提高下游表型关联分析中的预测准确性非常重要。对于 41 个 DNAm 预测因子中的每一个,我们报告其一致性程度,并提供性能最佳的分析策略作为研究界的指南。随着基于 DNAm 的预测因子越来越广泛地使用,我们的工作有助于提高它们的性能并标准化它们的实施。