Hong Mun-Gwan, Pawitan Yudi, Magnusson Patrik K E, Prince Jonathan A
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.
Hum Genet. 2009 Aug;126(2):289-301. doi: 10.1007/s00439-009-0676-z. Epub 2009 May 1.
A fundamental question in human genetics is the degree to which the polygenic character of complex traits derives from polymorphism in genes with similar or with dissimilar functions. The many genome-wide association studies now being performed offer an opportunity to investigate this, and although early attempts are emerging, new tools and modeling strategies still need to be developed and deployed. Towards this goal, we implemented a new algorithm to facilitate the transition from genetic marker lists (principally those generated by PLINK) to pathway analyses of representational gene sets in either threshold or threshold-free downstream applications (e.g. DAVID, GSEA-P, and Ingenuity Pathway Analysis). This was applied to several large genome-wide association studies covering diverse human traits that included type 2 diabetes, Crohn's disease, and plasma lipid levels. Validation of this approach was obtained for plasma HDL levels, where functional categories related to lipid metabolism emerged as the most significant in two independent studies. From analyses of these samples, we highlight and address numerous issues related to this strategy, including appropriate gene based correction statistics, the utility of imputed versus non-imputed marker sets, and the apparent enrichment of pathways due solely to the positional clustering of functionally related genes. The latter in particular emphasizes the importance of studies that directly tie genetic variation to functional characteristics of specific genes. The software freely provided that we have called ProxyGeneLD may resolve an important bottleneck in pathway-based analyses of genome-wide association data. This has allowed us to identify at least one replicable case of pathway enrichment but also to highlight functional gene clustering as a potentially serious problem that may lead to spurious pathway findings if not corrected.
人类遗传学中的一个基本问题是,复杂性状的多基因特性在多大程度上源自功能相似或不同的基因中的多态性。目前正在进行的众多全基因组关联研究为探究这一问题提供了契机,尽管已有早期尝试出现,但仍需开发和应用新的工具及建模策略。为实现这一目标,我们实施了一种新算法,以促进从遗传标记列表(主要由PLINK生成)到下游阈值或无阈值应用(如DAVID、GSEA-P和Ingenuity Pathway Analysis)中代表性基因集的通路分析的转变。该算法应用于多项涵盖不同人类性状的大型全基因组关联研究,这些性状包括2型糖尿病、克罗恩病和血浆脂质水平。在血浆高密度脂蛋白水平方面获得了该方法的验证,在两项独立研究中,与脂质代谢相关的功能类别最为显著。通过对这些样本的分析,我们强调并解决了与该策略相关的众多问题,包括基于基因的适当校正统计、估算与非估算标记集的效用,以及仅由于功能相关基因的位置聚类而导致的通路明显富集。后者尤其强调了直接将遗传变异与特定基因的功能特征联系起来的研究的重要性。我们免费提供的名为ProxyGeneLD的软件可能会解决全基因组关联数据基于通路分析中的一个重要瓶颈。这使我们能够识别至少一个可重复的通路富集案例,同时也突出了功能基因聚类作为一个潜在的严重问题,如果不加以纠正,可能会导致虚假的通路发现。