Gurinovich Anastasia, Bae Harold, Farrell John J, Andersen Stacy L, Monti Stefano, Puca Annibale, Atzmon Gil, Barzilai Nir, Perls Thomas T, Sebastiani Paola
Bioinformatics Program, Boston University, Boston, MA, USA.
College of Public Health and Human Sciences, Oregon State University, Corvallis, OR, USA.
Bioinformatics. 2019 Sep 1;35(17):3046-3054. doi: 10.1093/bioinformatics/btz017.
Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery.
In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects' ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype.
PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage.
Supplementary data are available at Bioinformatics online.
在过去十年中,全基因组关联研究纳入了更多样化的人群。如果一个基因变异在不同人群中对表型有不同影响,那么将全基因组关联研究应用于整个数据集可能无法找出这些差异。在最终会导致诊断测试开发或药物发现的研究中,能够识别基因变异的人群特异性效应尤为重要。
在本文中,我们提出了PopCluster算法:一种自动发现基因变异的遗传效应在统计学上存在差异的个体子集的算法。PopCluster提供了一个简单的框架,无需事先了解受试者的种族即可直接分析基因型数据。PopCluster结合了逻辑回归建模、主成分分析、层次聚类和递归自底向上的树解析程序。对PopCluster的评估表明,在病例和对照之间等位基因频率差异很大的模拟中,该算法具有稳定的低假阳性率(约4%)和高真阳性率(>80%)。将PopCluster应用于长寿基因研究的数据,发现rs3764814(USP42)与该表型的关联存在种族依赖性异质性。
PopCluster使用R编程语言、PLINK和Eigensoft软件实现,可在以下GitHub存储库中找到:https://github.com/gurinovich/PopCluster ,其中包含其安装和使用说明。
补充数据可在《生物信息学》在线获取。