Suppr超能文献

MpBsmi:一种基于索引结构的连续生物序列模式识别新算法。

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure.

机构信息

College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei P.R.China, 066004.

The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao City, P.R.China, 066004.

出版信息

PLoS One. 2018 Apr 23;13(4):e0195601. doi: 10.1371/journal.pone.0195601. eCollection 2018.

Abstract

A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.

摘要

从生物序列数据中提取有意义的模式是发现基因、蛋白质及其遗传关系的生物学调控规则的重要方法。现有的序列模式发现算法,如 MSPM 和 FBSB,存在效率和准确性低的问题。针对这一问题,本文提出了一种基于数据索引结构的生物序列模式挖掘新算法 MpBsmi。MpBsmi 算法使用序列位置表 ST 和序列数据库索引结构 DB-Index 进行数据存储、挖掘和模式扩展。首先通过扫描一次序列数据库获得单项目的 ST 和 DB-Index。然后,开发了一种新的快速支持计数算法来挖掘表 ST,以识别频繁的单项目。基于连接策略,扩展频繁模式并通过扫描 DB-Index 更新扩展表 ST。使用快速支持计数算法获取频繁的扩展模式。最后,开发了一种新的扩展模式修剪技术,以避免生成不必要的大量候选模式。在 Pfam 数据库中的多个经典蛋白质序列上的实验结果验证了该算法的性能,包括准确性、稳定性和可扩展性。结果表明,与生物序列挖掘的两个主要算法 MSPM 和 FBSB 相比,该算法在空间效率、稳定性和可扩展性方面都取得了更好的效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/493b/5912758/11f82c0d0b4a/pone.0195601.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验