Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
Nat Commun. 2020 Sep 17;11(1):4703. doi: 10.1038/s41467-020-18515-4.
Deep learning models have shown great promise in predicting regulatory effects from DNA sequence, but their informativeness for human complex diseases is not fully understood. Here, we evaluate genome-wide SNP annotations from two previous deep learning models, DeepSEA and Basenji, by applying stratified LD score regression to 41 diseases and traits (average N = 320K), conditioning on a broad set of coding, conserved and regulatory annotations. We aggregated annotations across all (respectively blood or brain) tissues/cell-types in meta-analyses across all (respectively 11 blood or 8 brain) traits. The annotations were highly enriched for disease heritability, but produced only limited conditionally significant results: non-tissue-specific and brain-specific Basenji-H3K4me3 for all traits and brain traits respectively. We conclude that deep learning models have yet to achieve their full potential to provide considerable unique information for complex disease, and that their conditional informativeness for disease cannot be inferred from their accuracy in predicting regulatory annotations.
深度学习模型在从 DNA 序列预测调控效应方面表现出巨大的潜力,但它们对人类复杂疾病的信息量还不完全清楚。在这里,我们通过分层 LD 得分回归,应用于 41 种疾病和特征(平均 N=320K),对之前的两个深度学习模型 DeepSEA 和 Basenji 的全基因组 SNP 注释进行了评估,同时考虑了广泛的编码、保守和调控注释。我们在所有(分别为 11 个血液或 8 个大脑)特征的所有(分别为 11 个血液或 8 个大脑)组织/细胞类型的荟萃分析中对注释进行了汇总。这些注释在疾病遗传力方面高度富集,但只产生了有限的条件显著结果:非组织特异性和大脑特异性 Basenji-H3K4me3 分别用于所有特征和大脑特征。我们得出结论,深度学习模型尚未充分发挥其提供复杂疾病大量独特信息的潜力,而且它们对疾病的条件信息量不能从它们在预测调控注释方面的准确性推断出来。