Purcell Shaun, Neale Benjamin, Todd-Brown Kathe, Thomas Lori, Ferreira Manuel A R, Bender David, Maller Julian, Sklar Pamela, de Bakker Paul I W, Daly Mark J, Sham Pak C
Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA.
Am J Hum Genet. 2007 Sep;81(3):559-75. doi: 10.1086/519795. Epub 2007 Jul 25.
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
全基因组关联研究(WGAS)给研究人员带来了新的计算和分析挑战。许多现有的基因分析工具并非设计用于方便地处理如此庞大的数据集,也不一定能利用全基因组数据带来的新机遇。为解决这些问题,我们开发了PLINK,一个开源的C/C++全基因组关联研究工具集。使用PLINK,可以对包含为数千个个体进行基因分型的数十万个标记的大型数据集进行快速的整体处理和分析。除了提供使基本分析步骤在计算上高效的工具外,PLINK还支持一些利用全基因组覆盖优势的全基因组数据新方法。我们介绍PLINK并描述其五个主要功能领域:数据管理、汇总统计、群体分层、关联分析和同源性估计。特别是,我们重点关注在基于群体的全基因组研究背景下,状态同源性和血缘同源性信息的估计与使用。这些信息可用于检测和校正群体分层,并识别在亲缘关系非常远的个体之间通过血缘共享的延伸染色体片段。对片段共享模式的分析有可能在基于群体的连锁分析中定位包含多个罕见变异的疾病基因座。