University of Oxford, Oxford, United Kingdom.
MRC Harwell Institute, Harwell, United Kingdom.
PLoS Biol. 2022 Aug 9;20(8):e3001723. doi: 10.1371/journal.pbio.3001723. eCollection 2022 Aug.
The function of the majority of genes in the human and mouse genomes is unknown. Investigating and illuminating this dark genome is a major challenge for the biomedical sciences. The International Mouse Phenotyping Consortium (IMPC) is addressing this through the generation and broad-based phenotyping of a knockout (KO) mouse line for every protein-coding gene, producing a multidimensional data set that underlies a genome-wide annotation map from genes to phenotypes. Here, we develop a multivariate (MV) statistical approach and apply it to IMPC data comprising 148 phenotypes measured across 4,548 KO lines. There are 4,256 (1.4% of 302,997 observed data measurements) hits called by the univariate (UV) model analysing each phenotype separately, compared to 31,843 (10.5%) hits in the observed data results of the MV model, corresponding to an estimated 7.5-fold increase in power of the MV model relative to the UV model. One key property of the data set is its 55.0% rate of missingness, resulting from quality control filters and incomplete measurement of some KO lines. This raises the question of whether it is possible to infer perturbations at phenotype-gene pairs at which data are not available, i.e., to infer some in vivo effects using statistical analysis rather than experimentation. We demonstrate that, even at missing phenotypes, the MV model can detect perturbations with power comparable to the single-phenotype analysis, thereby filling in the complete gene-phenotype map with good sensitivity. A factor analysis of the MV model's fitted covariance structure identifies 20 clusters of phenotypes, with each cluster tending to be perturbed collectively. These factors cumulatively explain 75% of the KO-induced variation in the data and facilitate biological interpretation of perturbations. We also demonstrate that the MV approach strengthens the correspondence between IMPC phenotypes and existing gene annotation databases. Analysis of a subset of KO lines measured in replicate across multiple laboratories confirms that the MV model increases power with high replicability.
人类和小鼠基因组中大多数基因的功能未知。探索和阐明这个“暗基因组”是生物医学科学面临的一个重大挑战。国际小鼠表型分析联盟(IMPC)正在通过生成和广泛表型分析每个编码蛋白基因的 KO 小鼠系来应对这一挑战,从而产生一个多维数据集,该数据集是从基因到表型的全基因组注释图谱的基础。在这里,我们开发了一种多变量(MV)统计方法,并将其应用于 IMPC 数据,这些数据包括在 4548 条 KO 系中测量的 148 种表型。有 4256 个(302997 个观察数据测量值的 1.4%)通过分别分析每种表型的单变量(UV)模型调用的命中,而 MV 模型的观察数据结果中有 31843 个(10.5%)命中,这对应于 MV 模型相对于 UV 模型的功率估计增加了 7.5 倍。数据集的一个关键特性是其 55.0%的缺失率,这是由于质量控制过滤器和某些 KO 系的不完全测量造成的。这就提出了一个问题,即是否有可能推断出数据不可用的表型-基因对中的扰动,也就是说,使用统计分析而不是实验来推断一些体内效应。我们证明,即使在缺失表型的情况下,MV 模型也可以以与单表型分析相当的功率检测到扰动,从而以良好的灵敏度填补完整的基因-表型图谱。MV 模型拟合协方差结构的因子分析确定了 20 个表型簇,每个簇倾向于集体受到干扰。这些因子累计解释了数据中 75%的 KO 诱导变异,并有助于对扰动进行生物学解释。我们还证明,MV 方法增强了 IMPC 表型与现有基因注释数据库之间的对应关系。在多个实验室中重复测量的 KO 系子集的分析证实,MV 模型的功率随着可重复性的提高而增加。