SCP4ssd：使用自动化机器学习模型的核苷酸序列合成难度预测无服务器平台。

SCP4ssd: A Serverless Platform for Nucleotide Sequence Synthesis Difficulty Prediction Using an AutoML Model.

机构信息

College of Biotechnology, Tianjin University of Science & Technology, Tianjin 300308, China.

Biodesign Center, Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.

出版信息

Genes (Basel). 2023 Feb 28;14(3):605. doi: 10.3390/genes14030605.

DOI:10.3390/genes14030605

PMID:36980878

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10048150/

Abstract

DNA synthesis is widely used in synthetic biology to construct and assemble sequences ranging from short RBS to ultra-long synthetic genomes. Many sequence features, such as the GC content and repeat sequences, are known to affect the synthesis difficulty and subsequently the synthesis cost. In addition, there are latent sequence features, especially local characteristics of the sequence, which might affect the DNA synthesis process as well. Reliable prediction of the synthesis difficulty for a given sequence is important for reducing the cost, but this remains a challenge. In this study, we propose a new automated machine learning (AutoML) approach to predict the DNA synthesis difficulty, which achieves an F1 score of 0.930 and outperforms the current state-of-the-art model. We found local sequence features that were neglected in previous methods, which might also affect the difficulty of DNA synthesis. Moreover, experimental validation based on ten genes of strain MG1655 shows that our model can achieve an 80% accuracy, which is also better than the state of art. Moreover, we developed the cloud platform SCP4SSD using an entirely cloud-based serverless architecture for the convenience of the end users.

摘要

DNA 合成广泛应用于合成生物学中，用于构建和组装从短 RBS 到超长合成基因组的序列。许多序列特征，如 GC 含量和重复序列，已知会影响合成难度，进而影响合成成本。此外，还有潜在的序列特征，特别是序列的局部特征，也可能会影响 DNA 合成过程。可靠地预测给定序列的合成难度对于降低成本很重要，但这仍然是一个挑战。在本研究中，我们提出了一种新的自动化机器学习（AutoML）方法来预测 DNA 合成的难度，该方法的 F1 得分为 0.930，优于当前的最先进模型。我们发现了以前方法中忽略的局部序列特征，这些特征也可能影响 DNA 合成的难度。此外，基于菌株 MG1655 的十个基因的实验验证表明，我们的模型可以达到 80%的准确率，也优于最先进的水平。此外，我们开发了基于云的无服务器架构的云平台 SCP4SSD，方便最终用户使用。