Suppr超能文献

MACFP:编辑距离下的最大近似连续频繁模式挖掘

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.

作者信息

Shang Jingbo, Peng Jian, Han Jiawei

机构信息

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

出版信息

Proc SIAM Int Conf Data Min. 2016 May;2016:558-566. doi: 10.1137/1.9781611974348.63.

Abstract

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

摘要

旨在寻找连续模式子串的连续模式挖掘是频繁模式挖掘的一种特殊情况,并且在许多实际应用中发挥了关键作用,尤其是在生物序列分析、时间序列分析和网络日志挖掘中。字符串之间的近似,包括插入、删除和替换,在生物序列比较中被广泛使用。然而,大多数现有的字符串模式挖掘方法只考虑汉明距离,而不考虑插入/删除(indels)。由于计算复杂度高,特别是对于具有数十亿碱基对的DNA序列,编辑距离下的一般近似连续频繁模式挖掘很少受到关注。在本文中,我们针对这个问题引入了一种有效的解决方案。我们首先提出了最大近似连续频繁模式挖掘(MACFP)问题,该问题在长查询序列中识别编辑距离下的子串模式。然后,我们提出了一种具有线性时间复杂度的新颖算法,用于检查查询序列中子串模式的支持度是否高于预定义阈值,从而大大降低了MACFP的计算复杂度。借助这种快速决策算法,我们可以通过几种索引和搜索技术有效地解决原始模式发现问题。与几种现有方法相比,在序列模式分析方面的综合实验以及对癌症基因组学应用的研究证明了我们算法的有效性和效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ca9/5292242/4aa4d1b1ea61/nihms844263f1.jpg

相似文献

2
An algorithm for approximate tandem repeats.一种用于近似串联重复序列的算法。
J Comput Biol. 2001;8(1):1-18. doi: 10.1089/106652701300099038.
5
Approximate Graph Edit Distance in Quadratic Time.二次时间内的近似图编辑距离。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):483-494. doi: 10.1109/TCBB.2015.2478463. Epub 2015 Sep 14.
10
Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU.
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):386-395. doi: 10.1109/TCBB.2019.2935061. Epub 2021 Feb 4.

本文引用的文献

3
CONTRA: copy number analysis for targeted resequencing.对照:靶向重测序的拷贝数分析。
Bioinformatics. 2012 May 15;28(10):1307-13. doi: 10.1093/bioinformatics/bts146. Epub 2012 Apr 2.
5
Fast and accurate short read alignment with Burrows-Wheeler transform.使用Burrows-Wheeler变换进行快速准确的短读比对。
Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.
6
Tandem repeats over the edit distance.编辑距离上的串联重复序列。
Bioinformatics. 2007 Jan 15;23(2):e30-5. doi: 10.1093/bioinformatics/btl309.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验