Zelenka Natalie R, Di Cara Nina, Sharma Kieren, Sarvaharman Seeralan, Ghataora Jasdeep S, Parmeggiani Fabio, Nivala Jeff, Abdallah Zahraa S, Marucci Lucia, Gorochowski Thomas E
Jean Golding Institute, University of Bristol, Bristol, UK.
BrisEngBio, University of Bristol, Bristol, UK.
Synth Biol (Oxf). 2024 Jun 21;9(1):ysae010. doi: 10.1093/synbio/ysae010. eCollection 2024.
Data science is playing an increasingly important role in the design and analysis of engineered biology. This has been fueled by the development of high-throughput methods like massively parallel reporter assays, data-rich microscopy techniques, computational protein structure prediction and design, and the development of whole-cell models able to generate huge volumes of data. Although the ability to apply data-centric analyses in these contexts is appealing and increasingly simple to do, it comes with potential risks. For example, how might biases in the underlying data affect the validity of a result and what might the environmental impact of large-scale data analyses be? Here, we present a community-developed framework for assessing data hazards to help address these concerns and demonstrate its application to two synthetic biology case studies. We show the diversity of considerations that arise in common types of bioengineering projects and provide some guidelines and mitigating steps. Understanding potential issues and dangers when working with data and proactively addressing them will be essential for ensuring the appropriate use of emerging data-intensive AI methods and help increase the trustworthiness of their applications in synthetic biology.
数据科学在合成生物学的设计和分析中发挥着越来越重要的作用。大规模平行报告基因检测等高通量方法、数据丰富的显微镜技术、计算蛋白质结构预测与设计以及能够生成大量数据的全细胞模型的发展推动了这一趋势。尽管在这些情况下应用以数据为中心的分析很有吸引力且越来越容易实现,但也存在潜在风险。例如,基础数据中的偏差可能如何影响结果的有效性,大规模数据分析的环境影响又可能是什么?在此,我们提出一个由社区开发的用于评估数据危害的框架,以帮助解决这些问题,并展示其在两个合成生物学案例研究中的应用。我们展示了常见类型生物工程项目中出现的各种考量因素,并提供了一些指导方针和缓解措施。了解处理数据时的潜在问题和危险并积极应对,对于确保正确使用新兴的数据密集型人工智能方法以及提高其在合成生物学应用中的可信度至关重要。