Department of Applied Mathematics, School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.
Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China.
Bioinformatics. 2019 May 15;35(10):1729-1736. doi: 10.1093/bioinformatics/bty870.
A large number of recent genome-wide association studies (GWASs) for complex phenotypes confirm the early conjecture for polygenicity, suggesting the presence of large number of variants with only tiny or moderate effects. However, due to the limited sample size of a single GWAS, many associated genetic variants are too weak to achieve the genome-wide significance. These undiscovered variants further limit the prediction capability of GWAS. Restricted access to the individual-level data and the increasing availability of the published GWAS results motivate the development of methods integrating both the individual-level and summary-level data. How to build the connection between the individual-level and summary-level data determines the efficiency of using the existing abundant summary-level resources with limited individual-level data, and this issue inspires more efforts in the existing area.
In this study, we propose a novel statistical approach, LEP, which provides a novel way of modeling the connection between the individual-level data and summary-level data. LEP integrates both types of data by LEveraging Pleiotropy to increase the statistical power of risk variants identification and the accuracy of risk prediction. The algorithm for parameter estimation is developed to handle genome-wide-scale data. Through comprehensive simulation studies, we demonstrated the advantages of LEP over the existing methods. We further applied LEP to perform integrative analysis of Crohn's disease from WTCCC and summary statistics from GWAS of some other diseases, such as Type 1 diabetes, Ulcerative colitis and Primary biliary cirrhosis. LEP was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.39% (±0.58%) to 68.33% (±0.32%) using about 195 000 variants.
The LEP software is available at https://github.com/daviddaigithub/LEP.
Supplementary data are available at Bioinformatics online.
大量最近的全基因组关联研究(GWAS)证实了复杂表型的多基因假说,表明存在大量具有微小或中等效应的变体。然而,由于单个 GWAS 的样本量有限,许多相关的遗传变体太弱,无法达到全基因组的显著性。这些未被发现的变体进一步限制了 GWAS 的预测能力。个体水平数据的获取受限以及已发表的 GWAS 结果的可用性增加,促使人们开发了整合个体水平和汇总水平数据的方法。如何建立个体水平和汇总水平数据之间的联系,决定了在个体水平数据有限的情况下,利用现有的大量汇总水平资源的效率,这个问题激发了该领域更多的努力。
在这项研究中,我们提出了一种新的统计方法 LEP,它提供了一种新的建模个体水平数据和汇总水平数据之间联系的方法。LEP 通过利用多效性来整合这两种类型的数据,以提高识别风险变体的统计能力和风险预测的准确性。开发了用于处理全基因组规模数据的参数估计算法。通过综合模拟研究,我们证明了 LEP 优于现有方法的优势。我们进一步将 LEP 应用于 WTCCC 的克罗恩病综合分析和其他一些疾病(如 1 型糖尿病、溃疡性结肠炎和原发性胆汁性肝硬化)的 GWAS 的汇总统计数据。LEP 能够显著提高识别风险变体的统计能力,并将风险预测准确性从 63.39%(±0.58%)提高到 68.33%(±0.32%),使用了大约 195000 个变体。
LEP 软件可在 https://github.com/daviddaigithub/LEP 获得。
补充数据可在 Bioinformatics 在线获得。