Yang Weiqin, Li Dexin, Huang Ranran
Institute of Marine Science and Technology, Shandong University, Qingdao, China.
School of Computer Science and Technology, Shandong University, Qingdao, China.
Front Microbiol. 2023 Jul 5;14:1215609. doi: 10.3389/fmicb.2023.1215609. eCollection 2023.
In metabolic engineering and synthetic biology applications, promoters with appropriate strengths are critical. However, it is time-consuming and laborious to annotate promoter strength by experiments. Nowadays, constructing mutation-based synthetic promoter libraries that span multiple orders of magnitude of promoter strength is receiving increasing attention. A number of machine learning (ML) methods are applied to synthetic promoter strength prediction, but existing models are limited by the excessive proximity between synthetic promoters.
In order to enhance ML models to better predict the synthetic promoter strength, we propose EVMP(Extended Vision Mutant Priority), a universal framework which utilize mutation information more effectively. In EVMP, synthetic promoters are equivalently transformed into base promoter and corresponding -mer mutations, which are input into BaseEncoder and VarEncoder, respectively. EVMP also provides optional data augmentation, which generates multiple copies of the data by selecting different base promoters for the same synthetic promoter.
In Trc synthetic promoter library, EVMP was applied to multiple ML models and the model effect was enhanced to varying extents, up to 61.30% (MAE), while the SOTA(state-of-the-art) record was improved by 15.25% (MAE) and 4.03% (). Data augmentation based on multiple base promoters further improved the model performance by 17.95% (MAE) and 7.25% () compared with non-EVMP SOTA record.
In further study, extended vision (or -mer) is shown to be essential for EVMP. We also found that EVMP can alleviate the over-smoothing phenomenon, which may contributes to its effectiveness. Our work suggests that EVMP can highlight the mutation information of synthetic promoters and significantly improve the prediction accuracy of strength. The source code is publicly available on GitHub: https://github.com/Tiny-Snow/EVMP.
在代谢工程和合成生物学应用中,具有适当强度的启动子至关重要。然而,通过实验注释启动子强度既耗时又费力。如今,构建跨越多个启动子强度数量级的基于突变的合成启动子文库受到越来越多的关注。许多机器学习(ML)方法被应用于合成启动子强度预测,但现有模型受到合成启动子之间过度接近的限制。
为了增强ML模型以更好地预测合成启动子强度,我们提出了EVMP(扩展视觉突变优先级),这是一个更有效地利用突变信息的通用框架。在EVMP中,合成启动子被等效地转换为基础启动子和相应的-mer突变,分别输入到BaseEncoder和VarEncoder中。EVMP还提供了可选的数据增强,通过为同一个合成启动子选择不同的基础启动子来生成多个数据副本。
在Trc合成启动子文库中,EVMP被应用于多个ML模型,模型效果在不同程度上得到了增强,最高可达61.30%(平均绝对误差),而最先进(SOTA)记录提高了15.25%(平均绝对误差)和4.03%()。与非EVMP的SOTA记录相比,基于多个基础启动子的数据增强进一步将模型性能提高了17.95%(平均绝对误差)和7.25%()。
在进一步的研究中,扩展视觉(或-mer)被证明对EVMP至关重要。我们还发现EVMP可以缓解过平滑现象,这可能有助于其有效性。我们的工作表明,EVMP可以突出合成启动子的突变信息,并显著提高强度预测的准确性。源代码可在GitHub上公开获取:https://github.com/Tiny-Snow/EVMP 。