Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN 46202, USA.
Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA.
Bioinformatics. 2022 Jun 13;38(12):3164-3172. doi: 10.1093/bioinformatics/btac214.
Though genome-wide association studies have identified tens of thousands of variants associated with complex traits and most of them fall within the non-coding regions, they may not be the causal ones. The development of high-throughput functional assays leads to the discovery of experimental validated non-coding functional variants. However, these validated variants are rare due to technical difficulty and financial cost. The small sample size of validated variants makes it less reliable to develop a supervised machine learning model for achieving a whole genome-wide prediction of non-coding causal variants.
We will exploit a deep transfer learning model, which is based on convolutional neural network, to improve the prediction for functional non-coding variants (NCVs). To address the challenge of small sample size, the transfer learning model leverages both large-scale generic functional NCVs to improve the learning of low-level features and context-specific functional NCVs to learn high-level features toward the context-specific prediction task. By evaluating the deep transfer learning model on three MPRA datasets and 16 GWAS datasets, we demonstrate that the proposed model outperforms deep learning models without pretraining or retraining. In addition, the deep transfer learning model outperforms 18 existing computational methods in both MPRA and GWAS datasets.
https://github.com/lichen-lab/TLVar.
Supplementary data are available at Bioinformatics online.
尽管全基因组关联研究已经确定了成千上万种与复杂性状相关的变体,其中大多数位于非编码区域,但它们可能不是因果变体。高通量功能检测方法的发展导致了实验验证的非编码功能变体的发现。然而,由于技术难度和资金成本,这些经过验证的变体很少。验证变体的样本量小,使得开发用于对非编码因果变体进行全基因组预测的监督机器学习模型变得不太可靠。
我们将利用基于卷积神经网络的深度迁移学习模型来改进对功能非编码变体(NCV)的预测。为了解决样本量小的问题,迁移学习模型利用大规模通用功能 NCV 来改进低层次特征的学习,并利用特定于上下文的功能 NCV 来学习高层次特征,以实现特定于上下文的预测任务。通过在三个 MPRA 数据集和 16 个 GWAS 数据集上评估深度迁移学习模型,我们证明了所提出的模型优于没有预训练或再训练的深度学习模型。此外,深度迁移学习模型在 MPRA 和 GWAS 数据集上均优于 18 种现有的计算方法。
https://github.com/lichen-lab/TLVar。
补充数据可在生物信息学在线获得。