Khodaee Farhan, Zandie Rohola, Edelman Elazer R
Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, 02139, MA, USA.
Department of Medicine (Cardiovascular Medicine), Brigham and Women's Hospital, Boston, 02115, MA, USA.
Res Sq. 2024 May 16:rs.3.rs-4355413. doi: 10.21203/rs.3.rs-4355413/v1.
How complex phenotypes emerge from intricate gene expression patterns is a fundamental question in biology. Quantitative characterization of this relationship, however, is challenging due to the vast combinatorial possibilities and dynamic interplay between genotype and phenotype landscapes. Integrating high-content genotyping approaches such as single-cell RNA sequencing and advanced learning methods such as language models offers an opportunity for dissecting this complex relationship. Here, we present a computational integrated genetics framework designed to analyze and interpret the high-dimensional landscape of genotypes and their associated phenotypes simultaneously. We applied this approach to develop a multimodal foundation model to explore the genotype-phenotype relationship manifold for human transcriptomics at the cellular level. Analyzing this joint manifold showed a refined resolution of cellular heterogeneity, enhanced precision in phenotype annotating, and uncovered potential cross-tissue biomarkers that are undetectable through conventional gene expression analysis alone. Moreover, our results revealed that the gene networks are characterized by scale-free patterns and show context-dependent gene-gene interactions, both of which result in significant variations in the topology of the gene network, particularly evident during aging. Finally, utilizing contextualized embeddings, we investigated gene polyfunctionality which illustrates the multifaceted roles that genes play in different biological processes, and demonstrated that for VWF gene in endothelial cells. Overall, this study advances our understanding of the dynamic interplay between gene expression and phenotypic manifestation and demonstrates the potential of integrated genetics in uncovering new dimensions of cellular function and complexity.
复杂的表型如何从错综复杂的基因表达模式中产生,这是生物学中的一个基本问题。然而,由于基因型和表型格局之间存在巨大的组合可能性和动态相互作用,对这种关系进行定量表征具有挑战性。整合诸如单细胞RNA测序等高内涵基因分型方法和诸如语言模型等先进学习方法,为剖析这种复杂关系提供了一个契机。在此,我们提出了一个计算整合遗传学框架,旨在同时分析和解释基因型及其相关表型的高维格局。我们应用这种方法开发了一个多模态基础模型,以在细胞水平上探索人类转录组学的基因型-表型关系流形。对这个联合流形的分析显示了细胞异质性的精细分辨率、表型注释的更高精度,并发现了仅通过传统基因表达分析无法检测到的潜在跨组织生物标志物。此外,我们的结果表明,基因网络具有无标度模式的特征,并表现出依赖于上下文的基因-基因相互作用,这两者都会导致基因网络拓扑结构的显著变化,在衰老过程中尤为明显。最后,利用上下文嵌入,我们研究了基因多功能性,它说明了基因在不同生物过程中所起的多方面作用,并以内皮细胞中的VWF基因为例进行了证明。总体而言,这项研究推进了我们对基因表达与表型表现之间动态相互作用的理解,并证明了整合遗传学在揭示细胞功能和复杂性新维度方面的潜力。