Suppr超能文献

一种用于挖掘未比对蛋白质序列中频繁模式的高效、通用且可扩展的模式增长方法。

An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

作者信息

Ye Kai, Kosters Walter A, Ijzerman Adriaan P

机构信息

Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research and Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands.

出版信息

Bioinformatics. 2007 Mar 15;23(6):687-93. doi: 10.1093/bioinformatics/btl665. Epub 2007 Jan 19.

Abstract

MOTIVATION

Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets.

RESULTS

In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets.

AVAILABILITY

The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.

摘要

动机

蛋白质序列中的模式发现通常基于多序列比对(MSA)。该过程计算量可能很大,并且常常需要人工调整,对于一组偏差较大的序列而言可能尤其困难。相比之下,有两种算法,即PRATT2(http//www.ebi.ac.uk/pratt/)和TEIRESIAS(http://cbcsrv.watson.ibm.com/),用于直接从未比对的生物序列中识别频繁出现的模式,而无需尝试对它们进行比对。在此,我们提出一种比PRATT2和TEIRESIAS更高效且功能更多的新算法,并讨论其在G蛋白偶联受体(一类重要药物靶点的蛋白质家族)方面的一些应用。

结果

在本研究中,我们设计并实现了六种算法,以使用模式增长方法从一个或两个数据集中挖掘三种不同类型的模式。我们在效率、完整性和模式类型的多样性方面将我们的方法与PRATT2和TEIRESIAS进行了比较。与PRATT2相比,我们的方法更快,能够处理大型数据集,并能够识别所谓的III型模式。在发现所谓的I型模式方面,我们的方法与TEIRESIAS相当,但具有额外的功能,例如挖掘所谓的II型和III型模式以及找到两个数据集之间的区分模式。

可用性

模式增长算法的源代码及其伪代码可在http://www.liacs.nl/home/kosters/pg/获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验