Ecology, Evolution, and Marine Biology, University of California, Santa Barbara, California 93106, USA.
Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae073.
Predicting phenotypes from genetic variation is foundational for fields as diverse as bioengineering and global change biology, highlighting the importance of efficient methods to predict gene functions. Linking genetic changes to phenotypic changes has been a goal of decades of experimental work, especially for some model gene families, including light-sensitive opsin proteins. Opsins can be expressed in vitro to measure light absorption parameters, including λmax-the wavelength of maximum absorbance-which strongly affects organismal phenotypes like color vision. Despite extensive research on opsins, the data remain dispersed, uncompiled, and often challenging to access, thereby precluding systematic and comprehensive analyses of the intricate relationships between genotype and phenotype.
Here, we report a newly compiled database of all heterologously expressed opsin genes with λmax phenotypes that we call the Visual Physiology Opsin Database (VPOD). VPOD_1.0 contains 864 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 73 separate publications. We use VPOD data and deepBreaks to show regression-based machine learning (ML) models often reliably predict λmax, account for nonadditive effects of mutations on function, and identify functionally critical amino acid sites.
The ability to reliably predict functions from gene sequences alone using ML will allow robust exploration of molecular-evolutionary patterns governing phenotype, will inform functional and evolutionary connections to an organism's ecological niche, and may be used more broadly for de novo protein design. Together, our database, phenotype predictions, and model comparisons lay the groundwork for future research applicable to families of genes with quantifiable and comparable phenotypes.
从遗传变异预测表型是生物工程和全球变化生物学等领域的基础,这凸显了高效预测基因功能的方法的重要性。将遗传变化与表型变化联系起来一直是数十年来实验工作的目标,特别是对于一些模型基因家族,包括对光敏感的视蛋白。可以在体外表达视蛋白来测量光吸收参数,包括 λmax-最大吸收波长-这强烈影响生物体的表型,如色觉。尽管对视蛋白进行了广泛的研究,但数据仍然分散、未编译,并且通常难以访问,从而排除了对基因型和表型之间复杂关系的系统和全面分析。
在这里,我们报告了一个新编译的包含 λmax 表型的异源表达视蛋白基因的数据库,我们称之为视觉生理学视蛋白数据库 (VPOD)。VPOD_1.0 包含 73 篇不同文献中从所有动物收集的 864 个独特视蛋白基因型和相应的 λmax 表型。我们使用 VPOD 数据和 deepBreaks 显示基于回归的机器学习 (ML) 模型通常可以可靠地预测 λmax,解释突变对功能的非加性影响,并确定功能关键的氨基酸位点。
仅使用 ML 从基因序列可靠地预测功能的能力将允许对控制表型的分子进化模式进行稳健探索,将为功能和进化联系到生物体的生态位提供信息,并可能更广泛地用于从头设计蛋白质。我们的数据库、表型预测和模型比较为未来适用于具有可量化和可比表型的基因家族的研究奠定了基础。