College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei P.R.China, 066004.
The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao City, P.R.China, 066004.
PLoS One. 2018 Apr 23;13(4):e0195601. doi: 10.1371/journal.pone.0195601. eCollection 2018.
A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.
从生物序列数据中提取有意义的模式是发现基因、蛋白质及其遗传关系的生物学调控规则的重要方法。现有的序列模式发现算法,如 MSPM 和 FBSB,存在效率和准确性低的问题。针对这一问题,本文提出了一种基于数据索引结构的生物序列模式挖掘新算法 MpBsmi。MpBsmi 算法使用序列位置表 ST 和序列数据库索引结构 DB-Index 进行数据存储、挖掘和模式扩展。首先通过扫描一次序列数据库获得单项目的 ST 和 DB-Index。然后,开发了一种新的快速支持计数算法来挖掘表 ST,以识别频繁的单项目。基于连接策略,扩展频繁模式并通过扫描 DB-Index 更新扩展表 ST。使用快速支持计数算法获取频繁的扩展模式。最后,开发了一种新的扩展模式修剪技术,以避免生成不必要的大量候选模式。在 Pfam 数据库中的多个经典蛋白质序列上的实验结果验证了该算法的性能,包括准确性、稳定性和可扩展性。结果表明,与生物序列挖掘的两个主要算法 MSPM 和 FBSB 相比,该算法在空间效率、稳定性和可扩展性方面都取得了更好的效果。