Institute of Computer Science, University of Tartu, Tartu, Estonia.
Institute of Genomics, University of Tartu, Tartu, Estonia; Institute of Mathematics and Statistics, University of Tartu, Tartu, Estonia.
HGG Adv. 2024 Oct 10;5(4):100348. doi: 10.1016/j.xhgg.2024.100348. Epub 2024 Aug 29.
Identifying causal genes underlying genome-wide association studies (GWASs) is a fundamental problem in human genetics. Although colocalization with gene expression quantitative trait loci (eQTLs) is often used to prioritize GWAS target genes, systematic benchmarking has been limited due to unavailability of large ground truth datasets. Here, we re-analyzed plasma protein QTL data from 3,301 individuals of the INTERVAL cohort together with 131 eQTL Catalog datasets. Focusing on variants located within or close to the affected protein identified 793 proteins with at least one cis-pQTL where we could assume that the most likely causal gene was the gene coding for the protein. We then benchmarked the ability of cis-eQTLs to recover these causal genes by comparing three Bayesian colocalization methods (coloc.susie, coloc.abf, and CLPP) and five Mendelian randomization (MR) approaches (three varieties of inverse-variance weighted MR, MR-RAPS, and MRLocus). We found that assigning fine-mapped pQTLs to their closest protein coding genes outperformed all colocalization methods regarding both precision (71.9%) and recall (76.9%). Furthermore, the colocalization method with the highest recall (coloc.susie - 46.3%) also had the lowest precision (45.1%). Combining evidence from multiple conditionally distinct colocalizing QTLs with MR increased precision to 81%, but this was accompanied by a large reduction in recall to 7.1%. Furthermore, the choice of the MR method greatly affected performance, with the standard inverse-variance-weighted MR often producing many false positives. Our results highlight that linking GWAS variants to target genes remains challenging with eQTL evidence alone, and prioritizing novel targets requires triangulation of evidence from multiple sources.
鉴定全基因组关联研究(GWAS)背后的因果基因是人类遗传学中的一个基本问题。尽管与基因表达数量性状基因座(eQTL)的共定位通常用于优先考虑 GWAS 靶基因,但由于缺乏大型真实数据集,系统基准测试受到限制。在这里,我们重新分析了 INTERVAL 队列中 3301 个人的血浆蛋白 QTL 数据以及 131 个 eQTL Catalog 数据集。我们专注于位于受影响蛋白内或附近的变体,鉴定了 793 种至少有一种顺式-pQTL 的蛋白,在这些蛋白中,我们可以假设最可能的因果基因是编码蛋白的基因。然后,我们通过比较三种贝叶斯共定位方法(coloc.susie、coloc.abf 和 CLPP)和五种孟德尔随机化(MR)方法(三种逆方差加权 MR 变体、MR-RAPS 和 MRLocus)来评估 cis-eQTL 恢复这些因果基因的能力。我们发现,将精细映射的 pQTL 分配给它们最近的蛋白质编码基因,在精确性(71.9%)和召回率(76.9%)方面都优于所有共定位方法。此外,召回率最高的共定位方法(coloc.susie-46.3%)也具有最低的精确性(45.1%)。将多个条件不同的共定位 QTL 的证据与 MR 相结合,可将精度提高到 81%,但这伴随着召回率大幅下降至 7.1%。此外,MR 方法的选择极大地影响了性能,标准的逆方差加权 MR 经常产生许多假阳性。我们的研究结果表明,仅使用 eQTL 证据将 GWAS 变体与靶基因联系起来仍然具有挑战性,并且优先考虑新的靶标需要来自多个来源的证据进行三角测量。