MACFP：编辑距离下的最大近似连续频繁模式挖掘

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.

作者信息

Shang Jingbo, Peng Jian, Han Jiawei

机构信息

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

出版信息

Proc SIAM Int Conf Data Min. 2016 May;2016:558-566. doi: 10.1137/1.9781611974348.63.

DOI:10.1137/1.9781611974348.63

PMID:28174677

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5292242/

Abstract

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

摘要

旨在寻找连续模式子串的连续模式挖掘是频繁模式挖掘的一种特殊情况，并且在许多实际应用中发挥了关键作用，尤其是在生物序列分析、时间序列分析和网络日志挖掘中。字符串之间的近似，包括插入、删除和替换，在生物序列比较中被广泛使用。然而，大多数现有的字符串模式挖掘方法只考虑汉明距离，而不考虑插入/删除（indels）。由于计算复杂度高，特别是对于具有数十亿碱基对的DNA序列，编辑距离下的一般近似连续频繁模式挖掘很少受到关注。在本文中，我们针对这个问题引入了一种有效的解决方案。我们首先提出了最大近似连续频繁模式挖掘（MACFP）问题，该问题在长查询序列中识别编辑距离下的子串模式。然后，我们提出了一种具有线性时间复杂度的新颖算法，用于检查查询序列中子串模式的支持度是否高于预定义阈值，从而大大降低了MACFP的计算复杂度。借助这种快速决策算法，我们可以通过几种索引和搜索技术有效地解决原始模式发现问题。与几种现有方法相比，在序列模式分析方面的综合实验以及对癌症基因组学应用的研究证明了我们算法的有效性和效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ca9/5292242/4aa4d1b1ea61/nihms844263f1.jpg

相似文献

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.MACFP：编辑距离下的最大近似连续频繁模式挖掘

Proc SIAM Int Conf Data Min. 2016 May;2016:558-566. doi: 10.1137/1.9781611974348.63.

An algorithm for approximate tandem repeats.一种用于近似串联重复序列的算法。

J Comput Biol. 2001;8(1):1-18. doi: 10.1089/106652701300099038.

Error Tree: A Tree Structure for Hamming and Edit Distances and Wildcards Matching.错误树：用于汉明距离、编辑距离和通配符匹配的树结构。

J Comput Biol. 2015 Dec;22(12):1118-28. doi: 10.1089/cmb.2015.0132. Epub 2015 Sep 24.

Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.

Approximate Graph Edit Distance in Quadratic Time.二次时间内的近似图编辑距离。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):483-494. doi: 10.1109/TCBB.2015.2478463. Epub 2015 Sep 14.

libFLASM: a software library for fixed-length approximate string matching.libFLASM：一个用于固定长度近似字符串匹配的软件库。

BMC Bioinformatics. 2016 Nov 10;17(1):454. doi: 10.1186/s12859-016-1320-2.

An Efficient Incremental Mining Algorithm for Discovering Sequential Pattern in Wireless Sensor Network Environments.一种在无线传感器网络环境中发现序列模式的高效增量挖掘算法。

Sensors (Basel). 2018 Dec 21;19(1):29. doi: 10.3390/s19010029.

Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns.快速在线和基于索引的算法，用于近似搜索 RNA 序列-结构模式。

BMC Bioinformatics. 2013 Jul 17;14:226. doi: 10.1186/1471-2105-14-226.

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.移位汉明距离：一种快速且准确的便于单指令多数据（SIMD）处理的过滤器，用于加速读段映射中的比对验证。

Bioinformatics. 2015 May 15;31(10):1553-60. doi: 10.1093/bioinformatics/btu856. Epub 2015 Jan 10.

Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):386-395. doi: 10.1109/TCBB.2019.2935061. Epub 2021 Feb 4.

本文引用的文献

Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.

mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications.FAST-Ultra 软件：一款用于高性能测序应用的紧凑、SNP 感知型映射器。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W494-500. doi: 10.1093/nar/gku370. Epub 2014 May 8.

CONTRA: copy number analysis for targeted resequencing.对照：靶向重测序的拷贝数分析。

Bioinformatics. 2012 May 15;28(10):1307-13. doi: 10.1093/bioinformatics/bts146. Epub 2012 Apr 2.

mrsFAST: a cache-oblivious algorithm for short-read mapping.mrsFAST：一种用于短读段映射的缓存无关算法。

Nat Methods. 2010 Aug;7(8):576-7. doi: 10.1038/nmeth0810-576.

Fast and accurate short read alignment with Burrows-Wheeler transform.使用Burrows-Wheeler变换进行快速准确的短读比对。

Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.

Tandem repeats over the edit distance.编辑距离上的串联重复序列。

Bioinformatics. 2007 Jan 15;23(2):e30-5. doi: 10.1093/bioinformatics/btl309.

mreps: Efficient and flexible detection of tandem repeats in DNA.Mreps：高效灵活地检测DNA中的串联重复序列。

Nucleic Acids Res. 2003 Jul 1;31(13):3672-8. doi: 10.1093/nar/gkg617.

REPuter: the manifold applications of repeat analysis on a genomic scale.REPuter：基因组规模重复序列分析的多种应用

Nucleic Acids Res. 2001 Nov 15;29(22):4633-42. doi: 10.1093/nar/29.22.4633.

A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison.一种用于DNA序列的压缩算法及其在基因组比较中的应用。

Genome Inform Ser Workshop Genome Inform. 1999;10:51-61.

Tandem repeats finder: a program to analyze DNA sequences.串联重复序列查找器：一个用于分析DNA序列的程序。

Nucleic Acids Res. 1999 Jan 15;27(2):573-80. doi: 10.1093/nar/27.2.573.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验