Mukherjee Sumit, McCaw Zachary R, Pei Jingwen, Merkoulovitch Anna, Soare Tom, Tandon Raghav, Amar David, Somineni Hari, Klein Christoph, Satapati Santhosh, Lloyd David, Probert Christopher, Koller Daphne, O'Dushlaine Colm, Karaletsos Theofanis
Insitro Inc, South San Francisco, California 94080, United States.
Center for Machine Learning, Georgia Institute of Technology, Georgia 30332, United States.
Bioinform Adv. 2024 Sep 17;4(1):vbae135. doi: 10.1093/bioadv/vbae135. eCollection 2024.
Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (ding enetic valuation ethods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.
机器学习衍生的嵌入是高内涵数据模式的一种压缩表示形式。嵌入可以捕获有关疾病状态的详细信息,并且在定性上已证明对基因发现有用。尽管它们很有前景,但嵌入有一个主要限制:尚不清楚与嵌入相关的基因变异是否与感兴趣的疾病或性状相关。在这项工作中,我们描述了EmbedGEM(基因评估方法),这是一个系统评估嵌入在基因发现中的效用的框架。EmbedGEM专注于沿两个轴比较嵌入:遗传力和疾病相关性。作为遗传力的度量,我们考虑全基因组显著关联的数量以及显著位点的平均统计量。对于疾病相关性,我们为每个嵌入主成分计算多基因风险评分,然后在一个保留的评估患者集中评估它们与高置信度疾病或性状标签的关联。虽然我们开发EmbedGEM的动机是嵌入,但该方法通常适用于多变量性状,并且可以很容易地扩展以适应评估轴上的其他指标。我们通过在两个单独的数据集中评估嵌入和多变量性状来证明EmbedGEM的效用:(i)一个模拟的合成数据集,用于证明该框架根据遗传力和疾病相关性对性状进行正确排名的能力;(ii)来自英国生物银行的真实数据,包括代谢和肝脏相关性状。重要的是,我们表明遗传力更高并不一定会带来更高的疾病相关性。