Suppr超能文献

合成生物学家用于提高RNA测序分析可重复性的工具包。

A toolkit for enhanced reproducibility of RNASeq analysis for synthetic biologists.

作者信息

Garcia Benjamin J, Urrutia Joshua, Zheng George, Becker Diveena, Corbet Carolyn, Maschhoff Paul, Cristofaro Alexander, Gaffney Niall, Vaughn Matthew, Saxena Uma, Chen Yi-Pei, Gordon D Benjamin, Eslami Mohammed

机构信息

Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA.

Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA.

出版信息

Synth Biol (Oxf). 2022 Aug 23;7(1):ysac012. doi: 10.1093/synbio/ysac012. eCollection 2022.

Abstract

Sequencing technologies, in particular RNASeq, have become critical tools in the design, build, test and learn cycle of synthetic biology. They provide a better understanding of synthetic designs, and they help identify ways to improve and select designs. While these data are beneficial to design, their collection and analysis is a complex, multistep process that has implications on both discovery and reproducibility of experiments. Additionally, tool parameters, experimental metadata, normalization of data and standardization of file formats present challenges that are computationally intensive. This calls for high-throughput pipelines expressly designed to handle the combinatorial and longitudinal nature of synthetic biology. In this paper, we present a pipeline to maximize the analytical reproducibility of RNASeq for synthetic biologists. We also explore the impact of reproducibility on the validation of machine learning models. We present the design of a pipeline that combines traditional RNASeq data processing tools with structured metadata tracking to allow for the exploration of the combinatorial design in a high-throughput and reproducible manner. We then demonstrate utility via two different experiments: a control comparison experiment and a machine learning model experiment. The first experiment compares datasets collected from identical biological controls across multiple days for two different organisms. It shows that a reproducible experimental protocol for one organism does not guarantee reproducibility in another. The second experiment quantifies the differences in experimental runs from multiple perspectives. It shows that the lack of reproducibility from these different perspectives can place an upper bound on the validation of machine learning models trained on RNASeq data. Graphical Abstract.

摘要

测序技术,尤其是RNA测序(RNAseq),已成为合成生物学设计、构建、测试和学习循环中的关键工具。它们有助于更好地理解合成设计,并有助于确定改进和选择设计的方法。虽然这些数据对设计有益,但其收集和分析是一个复杂的多步骤过程,对实验的发现和可重复性都有影响。此外,工具参数、实验元数据、数据归一化和文件格式标准化带来了计算量很大的挑战。这就需要专门设计的高通量流程来处理合成生物学的组合性和纵向性。在本文中,我们提出了一个流程,以最大限度地提高合成生物学家进行RNA测序分析的可重复性。我们还探讨了可重复性对机器学习模型验证的影响。我们展示了一个将传统RNA测序数据处理工具与结构化元数据跟踪相结合的流程设计,以便能够以高通量和可重复的方式探索组合设计。然后,我们通过两个不同的实验展示其实用性:一个对照比较实验和一个机器学习模型实验。第一个实验比较了在多天内从两种不同生物体的相同生物对照中收集的数据集。结果表明,一种生物体的可重复实验方案并不能保证在另一种生物体中也具有可重复性。第二个实验从多个角度量化了实验运行中的差异。结果表明,这些不同角度缺乏可重复性会对基于RNA测序数据训练的机器学习模型的验证设置上限。图形摘要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e119/9408027/47dd070e653b/ysac012f2.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验