Wjst Matthias
Gruppe Molekulare Epidemiologie, Institut für Epidemiologie, GSF - Forschungszentrum für Umwelt und Gesundheit, Ingolstädter Landstrasse 1, D-85758 Neuherberg/Munich, Germany.
BMC Bioinformatics. 2004 Jul 12;5:92. doi: 10.1186/1471-2105-5-92.
The massive amount of SNP data stored at public internet sites provides unprecedented access to human genetic variation. Selecting target SNP for disease-gene association studies is currently done more or less randomly as decision rules for the selection of functional relevant SNPs are not available.
We implemented a computational pipeline that retrieves the genomic sequence of target genes, collects information about sequence variation and selects functional motifs containing SNPs. Motifs being considered are gene promoter, exon-intron structure, AU-rich mRNA elements, transcription factor binding motifs, cryptic and enhancer splice sites together with expression in target tissue. As a case study, 396 genes on chromosome 6p21 in the extended HLA region were selected that contributed nearly 20,000 SNPs. By computer annotation ~2,500 SNPs in functional motifs could be identified. Most of these SNPs are disrupting transcription factor binding sites but only those introducing new sites had a significant depressing effect on SNP allele frequency. Other decision rules concern position within motifs, the validity of SNP database entries, the unique occurrence in the genome and conserved sequence context in other mammalian genomes.
Only 10% of all gene-based SNPs have sequence-predicted functional relevance making them a primary target for genotyping in association studies.
存储在公共互联网站点上的大量单核苷酸多态性(SNP)数据为获取人类遗传变异提供了前所未有的途径。目前,在疾病基因关联研究中选择目标SNP或多或少是随机进行的,因为尚无选择功能相关SNP的决策规则。
我们实施了一个计算流程,该流程检索目标基因的基因组序列,收集有关序列变异的信息,并选择包含SNP的功能基序。所考虑的基序包括基因启动子、外显子-内含子结构、富含AU的mRNA元件、转录因子结合基序、隐蔽和增强子剪接位点以及在目标组织中的表达。作为一个案例研究,我们选择了扩展的HLA区域中6号染色体p21上的396个基因,这些基因贡献了近20,000个SNP。通过计算机注释,可以识别出功能基序中的约2,500个SNP。这些SNP中的大多数破坏了转录因子结合位点,但只有那些引入新位点的SNP对SNP等位基因频率有显著的抑制作用。其他决策规则涉及基序内的位置、SNP数据库条目的有效性、在基因组中的独特出现以及其他哺乳动物基因组中的保守序列背景。
所有基于基因的SNP中只有10%具有序列预测的功能相关性,这使其成为关联研究中基因分型的主要目标。