Suppr超能文献

基于暹罗网络的对比学习在新型非启动子生成下对蓝藻启动子的识别

Recognition of cyanobacteria promoters via Siamese network-based contrastive learning under novel non-promoter generation.

作者信息

Yang Guang, Li Jianing, Hu Jinlu, Shi Jian-Yu

机构信息

School of Life Sciences, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, China.

School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, China.

出版信息

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae193.

Abstract

It is a vital step to recognize cyanobacteria promoters on a genome-wide scale. Computational methods are promising to assist in difficult biological identification. When building recognition models, these methods rely on non-promoter generation to cope with the lack of real non-promoters. Nevertheless, the factitious significant difference between promoters and non-promoters causes over-optimistic prediction. Moreover, designed for E. coli or B. subtilis, existing methods cannot uncover novel, distinct motifs among cyanobacterial promoters. To address these issues, this work first proposes a novel non-promoter generation strategy called phantom sampling, which can eliminate the factitious difference between promoters and generated non-promoters. Furthermore, it elaborates a novel promoter prediction model based on the Siamese network (SiamProm), which can amplify the hidden difference between promoters and non-promoters through a joint characterization of global associations, upstream and downstream contexts, and neighboring associations w.r.t. k-mer tokens. The comparison with state-of-the-art methods demonstrates the superiority of our phantom sampling and SiamProm. Both comprehensive ablation studies and feature space illustrations also validate the effectiveness of the Siamese network and its components. More importantly, SiamProm, upon our phantom sampling, finds a novel cyanobacterial promoter motif ('GCGATCGC'), which is palindrome-patterned, content-conserved, but position-shifted.

摘要

在全基因组范围内识别蓝藻启动子是至关重要的一步。计算方法有望辅助进行困难的生物学识别。在构建识别模型时,这些方法依赖于非启动子生成来应对真实非启动子的缺乏。然而,启动子与非启动子之间人为的显著差异导致预测过于乐观。此外,现有方法是针对大肠杆菌或枯草芽孢杆菌设计的,无法揭示蓝藻启动子中新颖、独特的基序。为了解决这些问题,这项工作首先提出了一种名为虚拟采样的新型非启动子生成策略,该策略可以消除启动子与生成的非启动子之间的人为差异。此外,还阐述了一种基于连体网络的新型启动子预测模型(SiamProm),该模型可以通过对全局关联、上下游上下文以及与k-mer标记相关的相邻关联进行联合表征,放大启动子与非启动子之间的隐藏差异。与现有方法的比较证明了我们的虚拟采样和SiamProm的优越性。全面的消融研究和特征空间说明也验证了连体网络及其组件的有效性。更重要的是,基于我们的虚拟采样,SiamProm发现了一种新型的蓝藻启动子基序(“GCGATCGC”),它具有回文模式、内容保守但位置偏移。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/225d/11066903/84996982faae/bbae193f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验