Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, Republic of Korea.
Department of Biostatistics, Columbia University, New York, NY, USA.
HGG Adv. 2023 Jul 11;4(4):100223. doi: 10.1016/j.xhgg.2023.100223. eCollection 2023 Oct 12.
Accurate imputation of tissue-specific gene expression can be a powerful tool for understanding the biological mechanisms underlying human complex traits. Existing imputation methods can be grouped into two categories according to the types of predictors used. The first category uses genotype data, while the second category uses whole-blood expression data. Both data types can be easily collected from blood, avoiding invasive tissue biopsies. In this study, we attempted to build an optimal predictive model for imputing tissue-specific gene expression by combining the genotype and whole-blood expression data. We first evaluated the imputation performance of each standalone model (using genotype data [GEN model] and using whole-blood expression data [WBE model]) using their respective data types across 47 human tissues. The WBE model outperformed the GEN model in most tissues by a large gain. Then, we developed several combined models that leverage both types of predictors to further improve imputation performance. We tried various strategies, including utilizing a merged dataset of the two data types (MERGED models) and integrating the imputation outcomes of the two standalone models (inverse variance-weighted [IVW] models). We found that one of the MERGED models noticeably outperformed the standalone models. This model involved a fixed ratio between the two regularization penalty factors for the two predictor types so that the contribution of the whole-blood transcriptome is upweighted compared with the genotype. Our study suggests that one can improve the imputation of tissue-specific gene expression by combining the genotype and whole-blood expression, but the improvement can be largely dependent on the combination strategy chosen.
准确推断组织特异性基因表达可以成为理解人类复杂性状背后生物学机制的有力工具。现有的推断方法可以根据所使用的预测因子类型分为两类。第一类使用基因型数据,而第二类使用全血表达数据。这两种数据类型都可以很容易地从血液中收集,避免了侵入性的组织活检。在这项研究中,我们试图通过结合基因型和全血表达数据来构建一个最佳的预测模型来推断组织特异性基因表达。我们首先使用各自的数据类型评估了每个独立模型(使用基因型数据[GEN 模型]和使用全血表达数据[WBE 模型])在 47 个人类组织中的推断性能。在大多数组织中,WBE 模型的表现明显优于 GEN 模型,增益很大。然后,我们开发了几种结合两种预测因子的组合模型,以进一步提高推断性能。我们尝试了各种策略,包括利用两种数据类型的合并数据集(MERGED 模型)和整合两个独立模型的推断结果(逆方差加权[IVW]模型)。我们发现,MERGED 模型之一的表现明显优于独立模型。该模型涉及两种预测因子的两个正则化惩罚因子之间的固定比例,从而与基因型相比,全血转录组的贡献得到了增强。我们的研究表明,通过结合基因型和全血表达,可以提高组织特异性基因表达的推断,但改进效果在很大程度上取决于所选择的组合策略。