Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA.
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA.
BMC Bioinformatics. 2021 Sep 18;22(1):445. doi: 10.1186/s12859-021-04355-6.
Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.
We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob .
The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
在研究遗传疾病、基因进化、转录位点和其他生物机制时,识别基序并量化其出现频率非常重要。在马尔可夫假设下,用于估计基序计数分布的精确公式具有很高的计算复杂度,对于大型基序集来说是不切实际的。近似公式,例如基于复合泊松分布的公式,计算速度更快,但可靠的 p 值计算仍然具有挑战性。在这里,我们引入了“ motif_prob”,这是一种通过任意精度的渐进逼近来计算基序计数分布的精确公式的快速实现。我们的实现加速了精确计算,通常是不切实际的,使其可行,并有可能替代当前使用的启发式方法。
我们在 Perl 和 C++语言中实现了 motif_prob,使用高效的误差界迭代过程进行精确公式计算,在精度、运行时间基准测试方面与最先进的工具(例如 MoSDi)进行了比较,并提供了一个关于细菌基序特征描述的实际案例。我们的软件能够在常规笔记本电脑上,在一分钟内处理一百万个(13-31 个碱基)的基序,处理长度为五百万个碱基的基因组,并且 Perl 和 C++代码的运行时间都要小几个数量级(快 50-1000 倍)比 MoSDi 快,即使使用它们的快速复合泊松分布近似(快 60-120 倍)也是如此。在实际案例中,我们首先展示了 motif_prob 与 MoSDi 的一致性,然后展示了当细菌具有不同 GC 含量时,使用在抗菌药物耐药基因中发现的基序,如何对 p 值进行量化对于富集量化至关重要。该软件及其代码源可在 MIT 许可证下在 https://github.com/DataIntellSystLab/motif_prob 获得。
motif_prob 软件是一种用于计算基序精确频率分布的多平台高效开源解决方案。它可以与基序发现/特征描述工具集成,用于精确计算 p 值的富集和偏离预期频率范围,并保持数据处理效率不变。