• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

快速准确地定量生物序列中的基序出现次数。

Fast and exact quantification of motif occurrences in biological sequences.

机构信息

Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA.

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA.

出版信息

BMC Bioinformatics. 2021 Sep 18;22(1):445. doi: 10.1186/s12859-021-04355-6.

DOI:10.1186/s12859-021-04355-6
PMID:34537012
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8449872/
Abstract

BACKGROUND

Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.

RESULTS

We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob .

CONCLUSIONS

The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.

摘要

背景

在研究遗传疾病、基因进化、转录位点和其他生物机制时,识别基序并量化其出现频率非常重要。在马尔可夫假设下,用于估计基序计数分布的精确公式具有很高的计算复杂度,对于大型基序集来说是不切实际的。近似公式,例如基于复合泊松分布的公式,计算速度更快,但可靠的 p 值计算仍然具有挑战性。在这里,我们引入了“ motif_prob”,这是一种通过任意精度的渐进逼近来计算基序计数分布的精确公式的快速实现。我们的实现加速了精确计算,通常是不切实际的,使其可行,并有可能替代当前使用的启发式方法。

结果

我们在 Perl 和 C++语言中实现了 motif_prob,使用高效的误差界迭代过程进行精确公式计算,在精度、运行时间基准测试方面与最先进的工具(例如 MoSDi)进行了比较,并提供了一个关于细菌基序特征描述的实际案例。我们的软件能够在常规笔记本电脑上,在一分钟内处理一百万个(13-31 个碱基)的基序,处理长度为五百万个碱基的基因组,并且 Perl 和 C++代码的运行时间都要小几个数量级(快 50-1000 倍)比 MoSDi 快,即使使用它们的快速复合泊松分布近似(快 60-120 倍)也是如此。在实际案例中,我们首先展示了 motif_prob 与 MoSDi 的一致性,然后展示了当细菌具有不同 GC 含量时,使用在抗菌药物耐药基因中发现的基序,如何对 p 值进行量化对于富集量化至关重要。该软件及其代码源可在 MIT 许可证下在 https://github.com/DataIntellSystLab/motif_prob 获得。

结论

motif_prob 软件是一种用于计算基序精确频率分布的多平台高效开源解决方案。它可以与基序发现/特征描述工具集成,用于精确计算 p 值的富集和偏离预期频率范围,并保持数据处理效率不变。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/4bd09daeda25/12859_2021_4355_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/6f6e54d6774e/12859_2021_4355_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/a3b8f2ddaa54/12859_2021_4355_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/4bd09daeda25/12859_2021_4355_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/6f6e54d6774e/12859_2021_4355_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/a3b8f2ddaa54/12859_2021_4355_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b1e/8449872/4bd09daeda25/12859_2021_4355_Fig3_HTML.jpg

相似文献

1
Fast and exact quantification of motif occurrences in biological sequences.快速准确地定量生物序列中的基序出现次数。
BMC Bioinformatics. 2021 Sep 18;22(1):445. doi: 10.1186/s12859-021-04355-6.
2
Efficient exact motif discovery.高效精确的基序发现
Bioinformatics. 2009 Jun 15;25(12):i356-64. doi: 10.1093/bioinformatics/btp188.
3
Faster exact Markovian probability functions for motif occurrences: a DFA-only approach.用于基序出现的更快精确马尔可夫概率函数:一种仅基于确定有限自动机的方法。
Bioinformatics. 2008 Dec 15;24(24):2839-48. doi: 10.1093/bioinformatics/btn525. Epub 2008 Oct 9.
4
A fast weak motif-finding algorithm based on community detection in graphs.基于图中社区检测的快速弱模式发现算法。
BMC Bioinformatics. 2013 Jul 17;14:227. doi: 10.1186/1471-2105-14-227.
5
Motif scraper: a cross-platform, open-source tool for identifying degenerate nucleotide motif matches in FASTA files.基序提取器:一个跨平台的、开源的工具,用于在 FASTA 文件中识别简并核苷酸基序匹配。
Bioinformatics. 2018 Nov 15;34(22):3926-3928. doi: 10.1093/bioinformatics/bty437.
6
Set cover-based methods for motif selection.基于集合覆盖的 motif 选择方法。
Bioinformatics. 2020 Feb 15;36(4):1044-1051. doi: 10.1093/bioinformatics/btz697.
7
Conservative extraction of over-represented extensible motifs.过度呈现的可扩展基序的保守提取。
Bioinformatics. 2005 Jun;21 Suppl 1:i9-18. doi: 10.1093/bioinformatics/bti1051.
8
Normal and compound poisson approximations for pattern occurrences in NGS reads.下一代测序(NGS)读段中模式出现的正态和复合泊松近似
J Comput Biol. 2012 Jun;19(6):839-54. doi: 10.1089/cmb.2012.0029.
9
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.
10
iMotifs: an integrated sequence motif visualization and analysis environment.iMotifs:一个集成的序列基序可视化和分析环境。
Bioinformatics. 2010 Mar 15;26(6):843-4. doi: 10.1093/bioinformatics/btq026. Epub 2010 Jan 26.

引用本文的文献

1
Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts.基因组学中的可解释人工智能:基于专家混合模型的转录因子结合位点预测
ArXiv. 2025 Jul 18:arXiv:2507.09754v2.
2
OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier.章鱼:基于磁盘的、多平台、对移动设备友好的宏基因组分类器。
AMIA Annu Symp Proc. 2025 May 22;2024:798-807. eCollection 2024.
3
Optimizing resource utilization for large scale problems through architecture aware scheduling.通过架构感知调度优化大规模问题的资源利用。

本文引用的文献

1
STREME: accurate and versatile sequence motif discovery.STREME:准确且通用的序列基序发现。
Bioinformatics. 2021 Sep 29;37(18):2834-2840. doi: 10.1093/bioinformatics/btab203.
2
Species-level evaluation of the human respiratory microbiome.人类呼吸道微生物组的种水平评价。
Gigascience. 2020 Apr 1;9(4). doi: 10.1093/gigascience/giaa038.
3
MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data.MEGARes 2.0:一个用于分类宏基因组序列数据中抗菌药物、杀生物剂和金属抗性决定因子的数据库。
Sci Rep. 2024 Nov 1;14(1):26356. doi: 10.1038/s41598-024-75711-8.
4
An average-case efficient two-stage algorithm for enumerating all longest common substrings of minimum length between genome pairs.一种用于枚举基因组对之间所有最短长度最长公共子串的平均情况高效两阶段算法。
Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:93-102. doi: 10.1109/ichi61247.2024.00020. Epub 2024 Aug 22.
5
OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier.章鱼:基于磁盘的、多平台、对移动设备友好的宏基因组学分类器。
bioRxiv. 2024 Aug 10:2024.03.15.585215. doi: 10.1101/2024.03.15.585215.
Nucleic Acids Res. 2020 Jan 8;48(D1):D561-D569. doi: 10.1093/nar/gkz1010.
4
ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery.ProSampler:一种在大型 ChIP-seq 数据集中用于组合基序发现的超快速和准确的基序查找器。
Bioinformatics. 2019 Nov 1;35(22):4632-4639. doi: 10.1093/bioinformatics/btz290.
5
Review of Different Sequence Motif Finding Algorithms.不同序列基序查找算法综述。
Avicenna J Med Biotechnol. 2019 Apr-Jun;11(2):130-148.
6
The BaMM web server for de-novo motif discovery and regulatory sequence analysis.BaMM 网页服务器,用于从头发现基序和调控序列分析。
Nucleic Acids Res. 2018 Jul 2;46(W1):W215-W220. doi: 10.1093/nar/gky431.
7
TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets.拖网生物:下一代测序数据集的在线从头基序发现工具。
BMC Genomics. 2018 Apr 5;19(1):238. doi: 10.1186/s12864-018-4630-0.
8
Sequence motif finder using memetic algorithm.基于进化算法的序列模体查找。
BMC Bioinformatics. 2018 Jan 3;19(1):4. doi: 10.1186/s12859-017-2005-1.
9
Benchmarking of methods for identification of antimicrobial resistance genes in bacterial whole genome data.细菌全基因组数据中抗菌药物耐药基因鉴定方法的基准测试。
J Antimicrob Chemother. 2016 Sep;71(9):2484-8. doi: 10.1093/jac/dkw184. Epub 2016 Jun 30.
10
Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs.通过发现 DNA 甲基化基序,揭示人类多能细胞中 DNA 甲基化、转录因子和组蛋白标记之间的串扰。
Genome Res. 2013 Dec;23(12):2013-29. doi: 10.1101/gr.155960.113. Epub 2013 Oct 22.