Department of Molecular Physiology & Biophysics, Nashville, TN, USA.
Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, USA.
Bioinformatics. 2019 May 1;35(9):1453-1460. doi: 10.1093/bioinformatics/bty826.
Given the complexity of genome regions, prioritize the functional effects of non-coding variants remains a challenge. Although several frameworks have been proposed for the evaluation of the functionality of non-coding variants, most of them used 'black boxes' methods that simplify the task as the pathogenicity/benign classification problem, which ignores the distinct regulatory mechanisms of variants and leads to less desirable performance. In this study, we developed DVAR, an unsupervised framework that leverage various biochemical and evolutionary evidence to distinguish the gene regulatory categories of variants and assess their comprehensive functional impact simultaneously.
DVAR performed de novo pattern discovery in high-dimensional data and identified five regulatory clusters of non-coding variants. Leveraging the new insights into the multiple functional patterns, it measures both the between-class and the within-class functional implication of the variants to achieve accurate prioritization. Compared to other two-class learning methods, it showed improved performance in identification of clinically significant variants, fine-mapped GWAS variants, eQTLs and expression-modulating variants. Moreover, it has superior performance on disease causal variants verified by genome-editing (like CRISPR-Cas9), which could provide a pre-selection strategy for genome-editing technologies across the whole genome. Finally, evaluated in BioVU and UK Biobank, two large-scale DNA biobanks linked to complete electronic health records, DVAR demonstrated its effectiveness in prioritizing non-coding variants associated with medical phenotypes.
The C++ and Python source codes, the pre-computed DVAR-cluster labels and DVAR-scores across the whole genome are available at https://www.vumc.org/cgg/dvar.
Supplementary data are available at Bioinformatics online.
由于基因组区域的复杂性,优先考虑非编码变异的功能影响仍然是一个挑战。尽管已经提出了几种评估非编码变异功能的框架,但它们大多数都使用了“黑盒”方法,将致病性/良性分类问题简化为任务,这忽略了变异的不同调节机制,并导致性能不理想。在这项研究中,我们开发了 DVAR,这是一种无监督的框架,利用各种生化和进化证据来区分变体的基因调控类别,并同时评估它们的综合功能影响。
DVAR 在高维数据中进行了全新的模式发现,并确定了五个非编码变异的调控聚类。利用对多种功能模式的新见解,它测量了变体之间的类间和类内功能含义,以实现准确的优先级排序。与其他两种学习方法相比,它在识别临床上有意义的变体、精细映射的 GWAS 变体、eQTLs 和表达调节变体方面表现出了更好的性能。此外,它在经过基因组编辑(如 CRISPR-Cas9)验证的疾病因果变异方面表现出了卓越的性能,这为整个基因组的基因组编辑技术提供了一种预选策略。最后,在两个与完整电子健康记录相关的大型 DNA 生物库(BioVU 和 UK Biobank)中进行评估,DVAR 证明了其在优先考虑与医学表型相关的非编码变异方面的有效性。
C++和 Python 源代码、预先计算的整个基因组的 DVAR 聚类标签和 DVAR 分数可在 https://www.vumc.org/cgg/dvar 上获得。
补充数据可在 Bioinformatics 在线获得。