MpBsmi：一种基于索引结构的连续生物序列模式识别新算法。

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure.

机构信息

College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei P.R.China, 066004.

The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao City, P.R.China, 066004.

出版信息

PLoS One. 2018 Apr 23;13(4):e0195601. doi: 10.1371/journal.pone.0195601. eCollection 2018.

DOI:10.1371/journal.pone.0195601

PMID:29684052

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5912758/

Abstract

A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.

摘要

从生物序列数据中提取有意义的模式是发现基因、蛋白质及其遗传关系的生物学调控规则的重要方法。现有的序列模式发现算法，如 MSPM 和 FBSB，存在效率和准确性低的问题。针对这一问题，本文提出了一种基于数据索引结构的生物序列模式挖掘新算法 MpBsmi。MpBsmi 算法使用序列位置表 ST 和序列数据库索引结构 DB-Index 进行数据存储、挖掘和模式扩展。首先通过扫描一次序列数据库获得单项目的 ST 和 DB-Index。然后，开发了一种新的快速支持计数算法来挖掘表 ST，以识别频繁的单项目。基于连接策略，扩展频繁模式并通过扫描 DB-Index 更新扩展表 ST。使用快速支持计数算法获取频繁的扩展模式。最后，开发了一种新的扩展模式修剪技术，以避免生成不必要的大量候选模式。在 Pfam 数据库中的多个经典蛋白质序列上的实验结果验证了该算法的性能，包括准确性、稳定性和可扩展性。结果表明，与生物序列挖掘的两个主要算法 MSPM 和 FBSB 相比，该算法在空间效率、稳定性和可扩展性方面都取得了更好的效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/493b/5912758/11f82c0d0b4a/pone.0195601.g001.jpg

相似文献

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure.MpBsmi：一种基于索引结构的连续生物序列模式识别新算法。

PLoS One. 2018 Apr 23;13(4):e0195601. doi: 10.1371/journal.pone.0195601. eCollection 2018.

Mining frequent biological sequences based on bitmap without candidate sequence generation.基于位图的频繁生物序列挖掘，无需候选序列生成。

Comput Biol Med. 2016 Feb 1;69:152-7. doi: 10.1016/j.compbiomed.2015.12.016. Epub 2015 Dec 30.

Frequent patterns mining in multiple biological sequences.多生物序列中的频繁模式挖掘。

Comput Biol Med. 2013 Oct;43(10):1444-52. doi: 10.1016/j.compbiomed.2013.07.009. Epub 2013 Jul 27.

Mining Contiguous Sequential Generators in Biological Sequences.挖掘生物序列中的连续序列生成器

IEEE/ACM Trans Comput Biol Bioinform. 2016 Sep-Oct;13(5):855-867. doi: 10.1109/TCBB.2015.2495132. Epub 2015 Oct 26.

PMBC: pattern mining from biological sequences with wildcard constraints.PMBC：带通配符约束的生物序列模式挖掘。

Comput Biol Med. 2013 Jun;43(5):481-92. doi: 10.1016/j.compbiomed.2013.02.006. Epub 2013 Mar 16.

Mining of high utility-probability sequential patterns from uncertain databases.从不确定数据库中挖掘高效用概率序列模式。

PLoS One. 2017 Jul 25;12(7):e0180931. doi: 10.1371/journal.pone.0180931. eCollection 2017.

Discovering metric temporal constraint networks on temporal databases.发现时态数据库上的度量时态约束网络。

Artif Intell Med. 2013 Jul;58(3):139-54. doi: 10.1016/j.artmed.2013.03.006. Epub 2013 May 6.

Expectation Maximization of Frequent Patterns, a Specific, Local, Pattern-Based Biclustering Algorithm for Biological Datasets.频繁模式的期望最大化算法，一种针对生物数据集的特定的、局部的、基于模式的双聚类算法。

IEEE/ACM Trans Comput Biol Bioinform. 2016 Sep-Oct;13(5):812-824. doi: 10.1109/TCBB.2015.2510011. Epub 2015 Dec 17.

Improved multiple sequence alignments using coupled pattern mining.使用耦合模式挖掘改进多序列比对。

IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1098-112. doi: 10.1109/TCBB.2013.36.

A node linkage approach for sequential pattern mining.一种用于序列模式挖掘的节点链接方法。

PLoS One. 2014 Jun 16;9(6):e95418. doi: 10.1371/journal.pone.0095418. eCollection 2014.

本文引用的文献

Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.生物火花：使用Hadoop和Spark对来自生物模拟和实验的大型数值数据集进行可扩展分析。

Bioinformatics. 2017 Jan 15;33(2):303-305. doi: 10.1093/bioinformatics/btw614. Epub 2016 Sep 22.

Mining frequent biological sequences based on bitmap without candidate sequence generation.基于位图的频繁生物序列挖掘，无需候选序列生成。

Comput Biol Med. 2016 Feb 1;69:152-7. doi: 10.1016/j.compbiomed.2015.12.016. Epub 2015 Dec 30.

Frequent patterns mining in multiple biological sequences.多生物序列中的频繁模式挖掘。

Comput Biol Med. 2013 Oct;43(10):1444-52. doi: 10.1016/j.compbiomed.2013.07.009. Epub 2013 Jul 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

MpBsmi：一种基于索引结构的连续生物序列模式识别新算法。

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献