LIRMM, Univ Montpellier, CNRS, Montpellier, France.
Institut Français de Bioinformatique, CNRS UAR 3601, Évry, France.
Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad141.
Seeking probabilistic motifs in a sequence is a common task to annotate putative transcription factor binding sites or other RNA/DNA binding sites. Useful motif representations include position weight matrices (PWMs), dinucleotide PWMs (di-PWMs), and hidden Markov models (HMMs). Dinucleotide PWMs not only combine the simplicity of PWMs-a matrix form and a cumulative scoring function-but also incorporate dependency between adjacent positions in the motif (unlike PWMs which disregard any dependency). For instance to represent binding sites, the HOCOMOCO database provides di-PWM motifs derived from experimental data. Currently, two programs, SPRy-SARUS and MOODS, can search for occurrences of di-PWMs in sequences.
We propose a Python package called dipwmsearch, which provides an original and efficient algorithm for this task (it first enumerates matching words for the di-PWM, and then searches these all at once in the sequence, even if the latter contains IUPAC codes). The user benefits from an easy installation via Pypi or conda, a comprehensive documentation, and executable scripts that facilitate the use of di-PWMs.
dipwmsearch is available at https://pypi.org/project/dipwmsearch/ and https://gite.lirmm.fr/rivals/dipwmsearch/ under Cecill license.
在序列中寻找概率基序是注释假定转录因子结合位点或其他 RNA/DNA 结合位点的常见任务。有用的基序表示形式包括位置权重矩阵 (PWMs)、二核苷酸 PWMs (di-PWMs) 和隐马尔可夫模型 (HMMs)。二核苷酸 PWMs 不仅结合了 PWM 的简单性——矩阵形式和累积评分函数,而且还包含基序中相邻位置之间的依赖性(与 PWM 不同,PWM 忽略任何依赖性)。例如,为了表示结合位点,HOCOMOCO 数据库提供了来自实验数据的 di-PWM 基序。目前,有两个程序,SPRy-SARUS 和 MOODS,可以在序列中搜索 di-PWM 的出现。
我们提出了一个名为 dipwmsearch 的 Python 包,它为这项任务提供了一种原始而有效的算法(它首先为 di-PWM 枚举匹配的单词,然后在序列中一次性搜索这些单词,即使后者包含 IUPAC 代码)。用户可以通过 Pypi 或 conda 轻松安装,文档全面,并且可执行脚本简化了 di-PWM 的使用。
dipwmsearch 可在 https://pypi.org/project/dipwmsearch/ 和 Cecill 许可证下的 https://gite.lirmm.fr/rivals/dipwmsearch/ 获得。