Konopka Tomasz, Vestito Letizia, Smedley Damian
William Harvey Research Institute, Queen Mary University of London, EC1M 6BQ London, UK.
Ear Institute, University College London, WC1X 8EE London, UK.
Bioinform Adv. 2021 Oct 11;1(1):vbab026. doi: 10.1093/bioadv/vbab026. eCollection 2021.
Animal models have long been used to study gene function and the impact of genetic mutations on phenotype. Through the research efforts of thousands of research groups, systematic curation of published literature and high-throughput phenotyping screens, the collective body of knowledge for the mouse now covers the majority of protein-coding genes. We here collected data for over 53 000 mouse models with mutations in over 15 000 genomic markers and characterized by more than 254 000 annotations using more than 9000 distinct ontology terms. We investigated dimensional reduction and embedding techniques as means to facilitate access to this diverse and high-dimensional information. Our analyses provide the first visual maps of the landscape of mouse phenotypic diversity. We also summarize some of the difficulties in producing and interpreting embeddings of sparse phenotypic data. In particular, we show that data preprocessing, filtering and encoding have as much impact on the final embeddings as the process of dimensional reduction. Nonetheless, techniques developed in the context of dimensional reduction create opportunities for explorative analysis of this large pool of public data, including for searching for mouse models suited to study human diseases.
Source code for analysis scripts is available on GitHub at https://github.com/tkonopka/mouse-embeddings. The data underlying this article are available in Zenodo at https://doi.org/10.5281/zenodo.4916171.
Supplementary data are available at online.
长期以来,动物模型一直被用于研究基因功能以及基因突变对表型的影响。通过数千个研究小组的研究工作、已发表文献的系统整理以及高通量表型筛选,目前关于小鼠的知识体系涵盖了大部分蛋白质编码基因。我们在此收集了超过53000个小鼠模型的数据,这些模型在超过15000个基因组标记中存在突变,并使用超过9000个不同的本体术语进行了超过254000次注释。我们研究了降维和嵌入技术,作为获取这些多样且高维信息的手段。我们的分析提供了小鼠表型多样性景观的首张可视化图谱。我们还总结了在生成和解释稀疏表型数据嵌入时遇到的一些困难。特别是,我们表明数据预处理(过滤和编码)对最终嵌入的影响与降维过程一样大。尽管如此,在降维背景下开发的技术为探索性分析这一大量公共数据创造了机会,包括寻找适合研究人类疾病的小鼠模型。
分析脚本的源代码可在GitHub上获取,网址为https://github.com/tkonopka/mouse-embeddings。本文所依据的数据可在Zenodo上获取,网址为https://doi.org/10.5281/zenodo.4916171。
补充数据可在网上获取。