Srivatsa Arjun, Schwartz Russell
Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States.
Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, United States.
Bioinform Adv. 2024 Dec 2;4(1):vbae193. doi: 10.1093/bioadv/vbae193. eCollection 2024.
Genomic biotechnology has rapidly advanced, allowing for the inference and modification of genetic and epigenetic information at the single-cell level. While these tools hold enormous potential for basic and clinical research, they also raise difficult issues of how to design studies to deploy them most effectively. In designing a genomic study, a modern researcher might combine many sequencing modalities and sampling protocols, each with different utility, costs, and other tradeoffs. This is especially relevant for studies of somatic variation, which may involve highly heterogeneous cell populations whose differences can be probed an extensive set of biotechnological tools. Efficiently deploying genomic technologies in this space will require principled ways to create study designs that recover desired genomic information while minimizing various measures of cost.
The central problem this paper attempts to address is how one might create an optimal study design for a genomic analysis, with particular focus on studies involving somatic variation that occur most often with application to cancer genomics. We pose the study design problem as a stochastic constrained nonlinear optimization problem. We introduce a Bayesian optimization framework that iteratively optimizes for an objective function using surrogate modeling combined with pattern and gradient search. We demonstrate our procedure on several test cases to derive resource and study design allocations optimized for various goals and criteria, demonstrating its ability to optimize study designs efficiently across diverse scenarios.
基因组生物技术发展迅速,能够在单细胞水平上推断和修改遗传及表观遗传信息。虽然这些工具在基础研究和临床研究中具有巨大潜力,但它们也引发了如何设计研究以最有效地应用这些工具的难题。在设计基因组研究时,现代研究人员可能会结合多种测序方式和采样方案,每种方式都有不同的效用、成本和其他权衡因素。这对于体细胞变异研究尤为重要,因为体细胞变异研究可能涉及高度异质的细胞群体,其差异可通过一系列广泛的生物技术工具进行探究。在这个领域有效部署基因组技术将需要有原则的方法来创建研究设计,以便在最小化各种成本指标的同时获取所需的基因组信息。
本文试图解决的核心问题是如何为基因组分析创建最优的研究设计,尤其关注涉及体细胞变异的研究,这类变异在癌症基因组学应用中最为常见。我们将研究设计问题表述为一个随机约束非线性优化问题。我们引入了一个贝叶斯优化框架,该框架使用代理模型结合模式和梯度搜索,对目标函数进行迭代优化。我们在几个测试案例上展示了我们的方法,以得出针对各种目标和标准进行优化的资源和研究设计分配方案,证明了其在不同场景下有效优化研究设计的能力。