Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, United States.
Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, United States.
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad378.
Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks.
We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach-enumerating at the image level-effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites.
Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033.
基序在计算生物学中起着至关重要的作用,因为它们提供了有关蛋白质结合特异性的有价值的信息。然而,传统的基序发现方法通常依赖于简单的组合或概率方法,这些方法可能会受到启发式方法的影响,例如用于多个基序发现的子字符串掩蔽。近年来,由于能够捕获数据中的复杂模式,深度神经网络在基序发现中变得越来越流行。尽管这些网络在监督学习任务中取得了成功,但从建模和计算的角度来看,从神经网络中推断基序仍然是一个具有挑战性的问题。
我们提出了一种基于分层稀疏表示的基序发现的有原则的表示学习方法。我们的方法有效地发现了缺口、长或重叠的基序,我们表明这些基序通常存在于下一代测序数据集中,除了短的和丰富的主要结合位点。我们的模型是完全可解释的、快速的,并且能够在大量 DNA 字符串中捕获基序。我们的方法中出现的一个关键概念——在图像级别进行枚举——有效地克服了 k-mers 范式,使我们能够利用适度的计算资源来捕获长而多样但保守的模式,以及捕获主要结合位点。
我们的方法作为一个 Julia 包在 MIT 许可证下提供,网址为 https://github.com/kchu25/MOTIFs.jl,实验数据的结果可以在 https://zenodo.org/record/7783033 上找到。