Toneyan Shushan, Tang Ziqi, Koo Peter K
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
Nat Mach Intell. 2022 Dec;4(12):1088-1100. doi: 10.1038/s42256-022-00570-9. Epub 2022 Dec 5.
Deep learning has been successful at predicting epigenomic profiles from DNA sequences. Most approaches frame this task as a binary classification relying on peak callers to define functional activity. Recently, quantitative models have emerged to directly predict the experimental coverage values as a regression. As new models continue to emerge with different architectures and training configurations, a major bottleneck is forming due to the lack of ability to fairly assess the novelty of proposed models and their utility for downstream biological discovery. Here we introduce a unified evaluation framework and use it to compare various binary and quantitative models trained to predict chromatin accessibility data. We highlight various modeling choices that affect generalization performance, including a downstream application of predicting variant effects. In addition, we introduce a robustness metric that can be used to enhance model selection and improve variant effect predictions. Our empirical study largely supports that quantitative modeling of epigenomic profiles leads to better generalizability and interpretability.
深度学习在从DNA序列预测表观基因组图谱方面取得了成功。大多数方法将此任务视为基于峰检测工具来定义功能活性的二分类问题。最近,定量模型已出现,可直接将实验覆盖值预测为回归问题。随着具有不同架构和训练配置的新模型不断涌现,由于缺乏公平评估所提出模型的新颖性及其对下游生物学发现的效用的能力,一个主要瓶颈正在形成。在此,我们引入了一个统一的评估框架,并使用它来比较为预测染色质可及性数据而训练的各种二分类和定量模型。我们强调了各种影响泛化性能的建模选择,包括预测变异效应的下游应用。此外,我们引入了一种稳健性度量,可用于加强模型选择并改进变异效应预测。我们的实证研究在很大程度上支持表观基因组图谱的定量建模可带来更好的泛化性和可解释性。