Stanek David, Bis-Brewer Dana M, Saghira Cima, Danzi Matt C, Seeman Pavel, Lassuthova Petra, Zuchner Stephan
DNA Laboratory, Department of Paediatric Neurology, 2nd Faculty of Medicine, Charles University in Prague and University Hospital Motol, Prague, V Úvalu 84, 150 06 Czech Republic.
Department of Human Genetics and John P. Hussman Institute for Human Genomics, Miller School of Medicine, University of Miami, Miami, FL 33136, USA.
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baz161.
Genetic variation occurring within conserved functional protein domains warrants special attention when examining DNA variation in the context of disease causation. Here we introduce a resource, freely available at www.prot2hg.com, that addresses the question of whether a particular variant falls onto an annotated protein domain and directly translates chromosomal coordinates onto protein residues. The tool can perform a multiple-site query in a simple way, and the whole dataset is available for download as well as incorporated into our own accessible pipeline. To create this resource, National Center for Biotechnology Information protein data were retrieved using the Entrez Programming Utilities. After processing all human protein domains, residue positions were reverse translated and mapped to the reference genome hg19 and stored in a MySQL database. In total, 760 487 protein domains from 42 371 protein models were mapped to hg19 coordinates and made publicly available for search or download (www.prot2hg.com). In addition, this annotation was implemented into the genomics research platform GENESIS in order to query nearly 8000 exomes and genomes of families with rare Mendelian disorders (tgp-foundation.org). When applied to patient genetic data, we found that rare (<1%) variants in the Genome Aggregation Database were significantly more annotated onto a protein domain in comparison to common (>1%) variants. Similarly, variants described as pathogenic or likely pathogenic in ClinVar were more likely to be annotated onto a domain. In addition, we tested a dataset consisting of 60 causal variants in a cohort of patients with epileptic encephalopathy and found that 71% of them (43 variants) were propagated onto protein domains. In summary, we developed a resource that annotates variants in the coding part of the genome onto conserved protein domains in order to increase variant prioritization efficiency. Database URL: www.prot2hg.com.
在疾病因果关系背景下研究DNA变异时,保守功能蛋白结构域内发生的遗传变异值得特别关注。在此,我们介绍一个可在www.prot2hg.com免费获取的资源,该资源可解决特定变异是否落在注释蛋白结构域上的问题,并能直接将染色体坐标转化为蛋白质残基。该工具能以简单方式进行多位点查询,整个数据集可供下载,也可纳入我们自己的可访问流程。为创建此资源,利用Entrez编程实用程序检索了美国国立生物技术信息中心的蛋白质数据。在处理所有人类蛋白质结构域后,将残基位置反向转化并映射到参考基因组hg19,存储在MySQL数据库中。总共,来自42371个蛋白质模型的760487个蛋白质结构域被映射到hg19坐标,并公开提供以供搜索或下载(www.prot2hg.com)。此外,此注释已应用于基因组学研究平台GENESIS,以便查询近8000个患有罕见孟德尔疾病家庭的外显子组和基因组(tgp - foundation.org)。应用于患者遗传数据时,我们发现基因组聚合数据库中罕见(<1%)变异相比于常见(>1%)变异,显著更有可能注释在蛋白质结构域上。同样,ClinVar中描述为致病或可能致病的变异更有可能注释在一个结构域上。此外,我们在一组癫痫性脑病患者中测试了一个由60个因果变异组成的数据集,发现其中71%(43个变异)可映射到蛋白质结构域上。总之,我们开发了一种资源,将基因组编码部分的变异注释到保守蛋白质结构域上,以提高变异优先级排序效率。数据库网址:www.prot2hg.com。