Yang Guang, Li Jianing, Hu Jinlu, Shi Jian-Yu
School of Life Sciences, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, China.
School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, China.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae193.
It is a vital step to recognize cyanobacteria promoters on a genome-wide scale. Computational methods are promising to assist in difficult biological identification. When building recognition models, these methods rely on non-promoter generation to cope with the lack of real non-promoters. Nevertheless, the factitious significant difference between promoters and non-promoters causes over-optimistic prediction. Moreover, designed for E. coli or B. subtilis, existing methods cannot uncover novel, distinct motifs among cyanobacterial promoters. To address these issues, this work first proposes a novel non-promoter generation strategy called phantom sampling, which can eliminate the factitious difference between promoters and generated non-promoters. Furthermore, it elaborates a novel promoter prediction model based on the Siamese network (SiamProm), which can amplify the hidden difference between promoters and non-promoters through a joint characterization of global associations, upstream and downstream contexts, and neighboring associations w.r.t. k-mer tokens. The comparison with state-of-the-art methods demonstrates the superiority of our phantom sampling and SiamProm. Both comprehensive ablation studies and feature space illustrations also validate the effectiveness of the Siamese network and its components. More importantly, SiamProm, upon our phantom sampling, finds a novel cyanobacterial promoter motif ('GCGATCGC'), which is palindrome-patterned, content-conserved, but position-shifted.
在全基因组范围内识别蓝藻启动子是至关重要的一步。计算方法有望辅助进行困难的生物学识别。在构建识别模型时,这些方法依赖于非启动子生成来应对真实非启动子的缺乏。然而,启动子与非启动子之间人为的显著差异导致预测过于乐观。此外,现有方法是针对大肠杆菌或枯草芽孢杆菌设计的,无法揭示蓝藻启动子中新颖、独特的基序。为了解决这些问题,这项工作首先提出了一种名为虚拟采样的新型非启动子生成策略,该策略可以消除启动子与生成的非启动子之间的人为差异。此外,还阐述了一种基于连体网络的新型启动子预测模型(SiamProm),该模型可以通过对全局关联、上下游上下文以及与k-mer标记相关的相邻关联进行联合表征,放大启动子与非启动子之间的隐藏差异。与现有方法的比较证明了我们的虚拟采样和SiamProm的优越性。全面的消融研究和特征空间说明也验证了连体网络及其组件的有效性。更重要的是,基于我们的虚拟采样,SiamProm发现了一种新型的蓝藻启动子基序(“GCGATCGC”),它具有回文模式、内容保守但位置偏移。