• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于稀疏表示的 DNA 图像寻找基序。

Finding motifs using DNA images derived from sparse representations.

机构信息

Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, United States.

Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, United States.

出版信息

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad378.

DOI:10.1093/bioinformatics/btad378
PMID:37294804
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10290554/
Abstract

MOTIVATION

Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks.

RESULTS

We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach-enumerating at the image level-effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites.

AVAILABILITY AND IMPLEMENTATION

Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033.

摘要

动机

基序在计算生物学中起着至关重要的作用,因为它们提供了有关蛋白质结合特异性的有价值的信息。然而,传统的基序发现方法通常依赖于简单的组合或概率方法,这些方法可能会受到启发式方法的影响,例如用于多个基序发现的子字符串掩蔽。近年来,由于能够捕获数据中的复杂模式,深度神经网络在基序发现中变得越来越流行。尽管这些网络在监督学习任务中取得了成功,但从建模和计算的角度来看,从神经网络中推断基序仍然是一个具有挑战性的问题。

结果

我们提出了一种基于分层稀疏表示的基序发现的有原则的表示学习方法。我们的方法有效地发现了缺口、长或重叠的基序,我们表明这些基序通常存在于下一代测序数据集中,除了短的和丰富的主要结合位点。我们的模型是完全可解释的、快速的,并且能够在大量 DNA 字符串中捕获基序。我们的方法中出现的一个关键概念——在图像级别进行枚举——有效地克服了 k-mers 范式,使我们能够利用适度的计算资源来捕获长而多样但保守的模式,以及捕获主要结合位点。

可用性和实现

我们的方法作为一个 Julia 包在 MIT 许可证下提供,网址为 https://github.com/kchu25/MOTIFs.jl,实验数据的结果可以在 https://zenodo.org/record/7783033 上找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/5c74c8e6262b/btad378f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/7eed3e11b02b/btad378f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/536c832ea214/btad378f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/971b73784d31/btad378f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/9d1164d18cb4/btad378f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/91fee6b4f8b2/btad378f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/5c74c8e6262b/btad378f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/7eed3e11b02b/btad378f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/536c832ea214/btad378f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/971b73784d31/btad378f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/9d1164d18cb4/btad378f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/91fee6b4f8b2/btad378f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/5c74c8e6262b/btad378f6.jpg

相似文献

1
Finding motifs using DNA images derived from sparse representations.基于稀疏表示的 DNA 图像寻找基序。
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad378.
2
Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure.利用 RNA 一级序列和二级结构的分布式表示来推断 RNA 结合蛋白结合位点的深度神经网络。
BMC Genomics. 2020 Dec 17;21(Suppl 13):866. doi: 10.1186/s12864-020-07239-w.
3
Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks.通过结合局部和全局深度卷积神经网络预测 RNA 与蛋白质的结合位点和基序。
Bioinformatics. 2018 Oct 15;34(20):3427-3436. doi: 10.1093/bioinformatics/bty364.
4
A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.基于蒙特卡罗的框架增强了调控序列基序的发现和解释。
BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.
5
Learning structural motif representations for efficient protein structure search.学习结构基元表示以实现高效的蛋白质结构搜索。
Bioinformatics. 2018 Sep 1;34(17):i773-i780. doi: 10.1093/bioinformatics/bty585.
6
STREME: accurate and versatile sequence motif discovery.STREME:准确且通用的序列基序发现。
Bioinformatics. 2021 Sep 29;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203.
7
Neural networks with circular filters enable data efficient inference of sequence motifs.具有循环滤波器的神经网络能够实现对序列基序的数据高效推断。
Bioinformatics. 2019 Oct 15;35(20):3937-3943. doi: 10.1093/bioinformatics/btz194.
8
HSMotifDiscover: identification of motifs in sequences composed of non-single-letter elements.HSMotifDiscover:识别由非单字母元素组成的序列中的基序。
Bioinformatics. 2022 Aug 10;38(16):4036-4038. doi: 10.1093/bioinformatics/btac437.
9
Biomedical image augmentation using Augmentor.使用 Augmentor 进行生物医学图像增强。
Bioinformatics. 2019 Nov 1;35(21):4522-4524. doi: 10.1093/bioinformatics/btz259.
10
Representation learning of genomic sequence motifs with convolutional neural networks.利用卷积神经网络进行基因组序列基元的表示学习。
PLoS Comput Biol. 2019 Dec 19;15(12):e1007560. doi: 10.1371/journal.pcbi.1007560. eCollection 2019 Dec.

引用本文的文献

1
Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet.用扩展的表观遗传字母表对甲基敏感转录因子基序进行建模。
Genome Biol. 2024 Jan 8;25(1):11. doi: 10.1186/s13059-023-03070-0.

本文引用的文献

1
On the dependent recognition of some long zinc finger proteins.关于一些长锌指蛋白的依赖性识别。
Nucleic Acids Res. 2023 Jun 23;51(11):5364-5376. doi: 10.1093/nar/gkad207.
2
A universal deep-learning model for zinc finger design enables transcription factor reprogramming.一种通用的深度学习模型可用于锌指设计,从而实现转录因子的重新编程。
Nat Biotechnol. 2023 Aug;41(8):1117-1129. doi: 10.1038/s41587-022-01624-4. Epub 2023 Jan 26.
3
DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning.DeepZF:通过深度迁移学习改进 C2H2-锌指蛋白的 DNA 结合预测。
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii62-ii67. doi: 10.1093/bioinformatics/btac469.
4
scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks.scBasset:基于序列的单细胞 ATAC-seq 卷积神经网络建模。
Nat Methods. 2022 Sep;19(9):1088-1096. doi: 10.1038/s41592-022-01562-8. Epub 2022 Aug 8.
5
JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles.JASPAR 2022:转录因子结合谱开放获取数据库的第 9 个版本。
Nucleic Acids Res. 2022 Jan 7;50(D1):D165-D173. doi: 10.1093/nar/gkab1113.
6
Factorbook: an updated catalog of transcription factor motifs and candidate regulatory motif sites.Factorbook:转录因子基序和候选调控基序位点的更新目录。
Nucleic Acids Res. 2022 Jan 7;50(D1):D141-D149. doi: 10.1093/nar/gkab1039.
7
ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments.ReMap 2022:一个整合了 DNA 结合测序实验分析的人类、小鼠、果蝇和拟南芥调控区域数据库。
Nucleic Acids Res. 2022 Jan 7;50(D1):D316-D325. doi: 10.1093/nar/gkab996.
8
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.
9
STREME: accurate and versatile sequence motif discovery.STREME:准确且通用的序列基序发现。
Bioinformatics. 2021 Sep 29;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203.
10
Base-resolution models of transcription-factor binding reveal soft motif syntax.基于分辨率的转录因子结合模型揭示了软基序语法。
Nat Genet. 2021 Mar;53(3):354-366. doi: 10.1038/s41588-021-00782-6. Epub 2021 Feb 18.