Casas Alexis, Bultelle Matthieu, Motraghi Charles, Kitney Richard
Department of Bioengineering, Imperial College London, London, United Kingdom.
Front Bioeng Biotechnol. 2022 Jan 10;9:785131. doi: 10.3389/fbioe.2021.785131. eCollection 2021.
We present a software tool, called cMatch, to reconstruct and identify synthetic genetic constructs from their sequences, or a set of sub-sequences-based on two practical pieces of information: their modular structure, and libraries of components. Although developed for combinatorial pathway engineering problems and addressing their quality control (QC) bottleneck, cMatch is not restricted to these applications. QC takes place post assembly, transformation and growth. It has a simple goal, to verify that the genetic material contained in a cell matches what was intended to be built - and when it is not the case, to locate the discrepancies and estimate their severity. In terms of reproducibility/reliability, the QC step is crucial. Failure at this step requires repetition of the construction and/or sequencing steps. When performed manually or semi-manually QC is an extremely time-consuming, error prone process, which scales very poorly with the number of constructs and their complexity. To make QC frictionless and more reliable, cMatch performs an operation we have called "construct-matching" and automates it. Construct-matching is more thorough than simple sequence-matching, as it matches at the functional level-and quantifies the matching at the individual component level and across the whole construct. Two algorithms (called CM_1 and CM_2) are presented. They differ according to the nature of their inputs. CM_1 is the core algorithm for construct-matching and is to be used when input sequences are long enough to cover constructs in their entirety (e.g., obtained with methods such as next generation sequencing). CM_2 is an extension designed to deal with shorter data (e.g., obtained with Sanger sequencing), and that need recombining. Both algorithms are shown to yield accurate construct-matching in a few minutes (even on hardware with limited processing power), together with a set of metrics that can be used to improve the robustness of the decision-making process. To ensure reliability and reproducibility, cMatch builds on the highly validated pairwise-matching Smith-Waterman algorithm. All the tests presented have been conducted on synthetic data for challenging, yet realistic constructs - and on real data gathered during studies on a metabolic engineering example (lycopene production).
我们展示了一种名为cMatch的软件工具,用于根据合成基因构建体的序列或基于两个实用信息的一组子序列来重建和识别它们:模块化结构和组件库。尽管cMatch是为组合途径工程问题而开发,并解决其质量控制(QC)瓶颈,但它并不局限于这些应用。质量控制在组装、转化和生长之后进行。它有一个简单的目标,即验证细胞中包含的遗传物质是否与预期构建的物质相匹配——如果不匹配,则找出差异并估计其严重程度。就可重复性/可靠性而言,质量控制步骤至关重要。此步骤失败需要重复构建和/或测序步骤。当手动或半手动执行时,质量控制是一个极其耗时、容易出错的过程,其随着构建体数量及其复杂性的增加扩展性很差。为了使质量控制更顺畅、更可靠,cMatch执行了一种我们称为“构建体匹配”的操作并将其自动化。构建体匹配比简单的序列匹配更全面,因为它在功能层面进行匹配,并在单个组件层面以及整个构建体层面量化匹配情况。我们提出了两种算法(称为CM_1和CM_2)。它们根据输入的性质而有所不同。CM_1是构建体匹配的核心算法,当输入序列足够长以完全覆盖构建体时(例如,通过下一代测序等方法获得)使用。CM_2是为处理较短数据(例如,通过桑格测序获得)而设计的扩展算法,这些数据需要重新组合。结果表明,这两种算法都能在几分钟内(即使在处理能力有限的硬件上)实现准确的构建体匹配,并提供一组可用于提高决策过程稳健性的指标。为确保可靠性和可重复性,cMatch基于经过高度验证的成对匹配史密斯 - 沃特曼算法构建。所展示的所有测试均针对具有挑战性但现实的构建体的合成数据以及在一个代谢工程实例(番茄红素生产)研究期间收集的真实数据进行。