Suppr
超能文献

基于稀疏表示的 DNA 图像寻找基序。

Finding motifs using DNA images derived from sparse representations.

机构信息

Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, United States.

Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, United States.

出版信息

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad378.

DOI:10.1093/bioinformatics/btad378

PMID:37294804

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10290554/

Abstract

MOTIVATION

Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks.

RESULTS

We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach-enumerating at the image level-effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites.

AVAILABILITY AND IMPLEMENTATION

Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033.

摘要

动机

基序在计算生物学中起着至关重要的作用，因为它们提供了有关蛋白质结合特异性的有价值的信息。然而，传统的基序发现方法通常依赖于简单的组合或概率方法，这些方法可能会受到启发式方法的影响，例如用于多个基序发现的子字符串掩蔽。近年来，由于能够捕获数据中的复杂模式，深度神经网络在基序发现中变得越来越流行。尽管这些网络在监督学习任务中取得了成功，但从建模和计算的角度来看，从神经网络中推断基序仍然是一个具有挑战性的问题。

结果

我们提出了一种基于分层稀疏表示的基序发现的有原则的表示学习方法。我们的方法有效地发现了缺口、长或重叠的基序，我们表明这些基序通常存在于下一代测序数据集中，除了短的和丰富的主要结合位点。我们的模型是完全可解释的、快速的，并且能够在大量 DNA 字符串中捕获基序。我们的方法中出现的一个关键概念——在图像级别进行枚举——有效地克服了 k-mers 范式，使我们能够利用适度的计算资源来捕获长而多样但保守的模式，以及捕获主要结合位点。

可用性和实现

我们的方法作为一个 Julia 包在 MIT 许可证下提供，网址为 https://github.com/kchu25/MOTIFs.jl，实验数据的结果可以在 https://zenodo.org/record/7783033 上找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d00/10290554/7eed3e11b02b/btad378f1.jpg

相似文献

Finding motifs using DNA images derived from sparse representations.

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad378.

Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure.

BMC Genomics. 2020 Dec 17;21(Suppl 13):866. doi: 10.1186/s12864-020-07239-w.

Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks.

Bioinformatics. 2018 Oct 15;34(20):3427-3436. doi: 10.1093/bioinformatics/bty364.

A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.

BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.

Learning structural motif representations for efficient protein structure search.

Bioinformatics. 2018 Sep 1;34(17):i773-i780. doi: 10.1093/bioinformatics/bty585.

STREME: accurate and versatile sequence motif discovery.

Bioinformatics. 2021 Sep 29;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203.

Neural networks with circular filters enable data efficient inference of sequence motifs.

Bioinformatics. 2019 Oct 15;35(20):3937-3943. doi: 10.1093/bioinformatics/btz194.

HSMotifDiscover: identification of motifs in sequences composed of non-single-letter elements.

Bioinformatics. 2022 Aug 10;38(16):4036-4038. doi: 10.1093/bioinformatics/btac437.

Biomedical image augmentation using Augmentor.

Bioinformatics. 2019 Nov 1;35(21):4522-4524. doi: 10.1093/bioinformatics/btz259.

Representation learning of genomic sequence motifs with convolutional neural networks.

PLoS Comput Biol. 2019 Dec 19;15(12):e1007560. doi: 10.1371/journal.pcbi.1007560. eCollection 2019 Dec.

引用本文的文献

Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet.

Genome Biol. 2024 Jan 8;25(1):11. doi: 10.1186/s13059-023-03070-0.

本文引用的文献

On the dependent recognition of some long zinc finger proteins.

Nucleic Acids Res. 2023 Jun 23;51(11):5364-5376. doi: 10.1093/nar/gkad207.

A universal deep-learning model for zinc finger design enables transcription factor reprogramming.

Nat Biotechnol. 2023 Aug;41(8):1117-1129. doi: 10.1038/s41587-022-01624-4. Epub 2023 Jan 26.

DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning.

Bioinformatics. 2022 Sep 16;38(Suppl_2):ii62-ii67. doi: 10.1093/bioinformatics/btac469.

scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks.

Nat Methods. 2022 Sep;19(9):1088-1096. doi: 10.1038/s41592-022-01562-8. Epub 2022 Aug 8.

JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles.

Nucleic Acids Res. 2022 Jan 7;50(D1):D165-D173. doi: 10.1093/nar/gkab1113.

Factorbook: an updated catalog of transcription factor motifs and candidate regulatory motif sites.

Nucleic Acids Res. 2022 Jan 7;50(D1):D141-D149. doi: 10.1093/nar/gkab1039.

ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments.

Nucleic Acids Res. 2022 Jan 7;50(D1):D316-D325. doi: 10.1093/nar/gkab996.

Effective gene expression prediction from sequence by integrating long-range interactions.

Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.

STREME: accurate and versatile sequence motif discovery.

Bioinformatics. 2021 Sep 29;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203.

Base-resolution models of transcription-factor binding reveal soft motif syntax.

Nat Genet. 2021 Mar;53(3):354-366. doi: 10.1038/s41588-021-00782-6. Epub 2021 Feb 18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

基于稀疏表示的 DNA 图像寻找基序。

Finding motifs using DNA images derived from sparse representations.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译