Zavorskas Joseph, Edwards Harley, Marten Mark R, Harris Steven, Srivastava Ranjan
Department of Chemical and Biomolecular Engineering, University of Connecticut, Storrs, Connecticut 06269, United States.
Department of Chemical, Biochemical, and Environmental Engineering, University of Maryland, Baltimore County, Baltimore, Maryland 21250, United States.
ACS Omega. 2024 Sep 30;9(40):41208-41216. doi: 10.1021/acsomega.4c01704. eCollection 2024 Oct 8.
We present an application of computational inverse design, which reverses the conventional trial-and-error forward design paradigm, optimizes biological phenotype by directly modifying genotype. The limitations of inverse design in genotype-to-bulk phenotype (G-BP) mapping can be addressed via an established design paradigm: "design, build, test, learn" (DBTL), where computational inverse design automates both the design and learn phases. In any context, inverse design is limited by the fundamental "one-to-many" nature of the inverse function. G-BP inverse design is further limited by the number of single nucleotide polymorphisms that can be made to a member of the population while maintaining feasibility of genotype creation and biological viability. Considering these limitations, we propose a design paradigm based on incremental optimization of phenotype through a combined computational and experimental approach. We intend this work to be a foundational synthesis of well-known techniques applied to the context of genotype-to-bulk phenotype inverse design, which has not yet been performed in the literature. The design pipeline can optimize phenotype by either directly proposing genotypic changes, or simply by suggesting parents to be used for selective breeding. The soybean nested association matrix data set is used to present an in silico case study of the design pipeline by performing optimization that maximizes protein content while constraining other phenotypes. A random forest (RF) is used to model the genotype-to-phenotype relationship, and a genetic algorithm is used to query the RF until a feasible genotype with desired phenotype is discovered. After 20 in silico DBTL cycles, a final population of individuals with a mean protein content of 36.13%, an increase of three standard deviations above the original mean is suggested.
我们展示了一种计算逆向设计的应用,它颠覆了传统的试错式正向设计范式,通过直接修改基因型来优化生物表型。逆向设计在基因型到整体表型(G-BP)映射中的局限性可以通过一种既定的设计范式来解决:“设计、构建、测试、学习”(DBTL),其中计算逆向设计使设计和学习阶段都实现了自动化。在任何情况下,逆向设计都受到反函数基本的“一对多”性质的限制。G-BP逆向设计还受到可对群体中的个体进行的单核苷酸多态性数量的限制,同时要保持基因型创建的可行性和生物活力。考虑到这些局限性,我们提出了一种基于通过计算和实验相结合的方法对表型进行增量优化的设计范式。我们希望这项工作能成为将知名技术应用于基因型到整体表型逆向设计背景下的基础综合,而这在文献中尚未有过。该设计流程可以通过直接提出基因型变化,或者仅仅通过建议用于选择性育种的亲本,来优化表型。大豆嵌套关联矩阵数据集被用于通过执行优化来呈现设计流程的计算机模拟案例研究,该优化在限制其他表型的同时最大化蛋白质含量。随机森林(RF)用于对基因型到表型的关系进行建模,遗传算法用于查询RF,直到发现具有所需表型的可行基因型。经过20个计算机模拟的DBTL循环后,建议最终群体的个体平均蛋白质含量为36.13%,比原始平均值高出三个标准差。