Kaplun Alexander, Krull Mathias, Lakshman Karthick, Matys Volker, Lewicki Birgit, Hogan Jennifer D
QIAGEN Bioinformatics, 35 Gatehouse Drive, Waltham, MA, 02451, USA.
BMC Genomics. 2016 Jun 23;17 Suppl 2(Suppl 2):393. doi: 10.1186/s12864-016-2724-0.
The regulatory effect of inherited or de novo genetic variants occurring in promoters as well as in transcribed or even coding gene regions is gaining greater recognition as a contributing factor to disease processes in addition to mutations affecting protein functionality. Thousands of such regulatory mutations are already recorded in HGMD, OMIM, ClinVar and other databases containing published disease causing and associated mutations. It is therefore important to properly annotate genetic variants occurring in experimentally verified and predicted transcription factor binding sites (TFBS) that could thus influence the factor binding event. Selection of the promoter sequence used is an important factor in the analysis as it directly influences the composition of the sequence available for transcription factor binding analysis.
In this study we first establish genomic regions likely to be involved in regulation of gene expression. TRANSFAC uses a method of virtual transcription start sites (vTSS) calculation to define the best supported promoter for a gene. We have performed a comparison of the virtually calculated promoters between the best supported and secondary promoters in hg19 and hg38 reference genomes to test and validate the approach. Next we create and utilize a workflow for systematic analysis of casual disease associated variants in TFBS using Genome Trax and TRANSFAC databases. A total of 841 and 736 experimentally verified TFBSs within best supported promoters were mapped over HGMD and ClinVar mutation sites respectively. Tens of thousands of predicted ChIP-Seq derived TFBSs were mapped over mutations as well. We have further analyzed some of these mutations for potential gain or loss in transcription factor binding.
We have confirmed the validity of TRANSFAC's approach to define the best supported promoters and established a workflow of their use in annotation of regulatory genetic variants.
除了影响蛋白质功能的突变外,启动子以及转录甚至编码基因区域中发生的遗传变异的调控作用作为疾病发生过程的一个促成因素正得到越来越多的认可。HGMD、OMIM、ClinVar和其他包含已发表的致病和相关突变的数据库中已经记录了数千种此类调控突变。因此,正确注释实验验证和预测的转录因子结合位点(TFBS)中发生的遗传变异非常重要,因为这些变异可能会影响因子结合事件。所使用的启动子序列的选择是分析中的一个重要因素,因为它直接影响可用于转录因子结合分析的序列组成。
在本研究中,我们首先确定可能参与基因表达调控的基因组区域。TRANSFAC使用虚拟转录起始位点(vTSS)计算方法来定义基因的最佳支持启动子。我们对hg19和hg38参考基因组中最佳支持启动子和次要启动子之间的虚拟计算启动子进行了比较,以测试和验证该方法。接下来,我们创建并利用了一个工作流程,使用Genome Trax和TRANSFAC数据库对TFBS中与疾病相关的偶然变异进行系统分析。分别在HGMD和ClinVar突变位点上定位了最佳支持启动子内总共841个和736个经实验验证的TFBS。数以万计的预测ChIP-Seq衍生TFBS也被定位到突变上。我们进一步分析了其中一些突变对转录因子结合的潜在增减情况。
我们已经证实了TRANSFAC定义最佳支持启动子方法的有效性,并建立了在调控遗传变异注释中使用这些启动子的工作流程。