一种用于挖掘未比对蛋白质序列中频繁模式的高效、通用且可扩展的模式增长方法。

An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

作者信息

Ye Kai, Kosters Walter A, Ijzerman Adriaan P

机构信息

Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research and Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands.

出版信息

Bioinformatics. 2007 Mar 15;23(6):687-93. doi: 10.1093/bioinformatics/btl665. Epub 2007 Jan 19.

DOI:10.1093/bioinformatics/btl665

PMID:17237070

Abstract

MOTIVATION

Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets.

RESULTS

In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets.

AVAILABILITY

The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.

摘要

动机

蛋白质序列中的模式发现通常基于多序列比对（MSA）。该过程计算量可能很大，并且常常需要人工调整，对于一组偏差较大的序列而言可能尤其困难。相比之下，有两种算法，即PRATT2（http//www.ebi.ac.uk/pratt/）和TEIRESIAS（http://cbcsrv.watson.ibm.com/），用于直接从未比对的生物序列中识别频繁出现的模式，而无需尝试对它们进行比对。在此，我们提出一种比PRATT2和TEIRESIAS更高效且功能更多的新算法，并讨论其在G蛋白偶联受体（一类重要药物靶点的蛋白质家族）方面的一些应用。

结果

在本研究中，我们设计并实现了六种算法，以使用模式增长方法从一个或两个数据集中挖掘三种不同类型的模式。我们在效率、完整性和模式类型的多样性方面将我们的方法与PRATT2和TEIRESIAS进行了比较。与PRATT2相比，我们的方法更快，能够处理大型数据集，并能够识别所谓的III型模式。在发现所谓的I型模式方面，我们的方法与TEIRESIAS相当，但具有额外的功能，例如挖掘所谓的II型和III型模式以及找到两个数据集之间的区分模式。

可用性

模式增长算法的源代码及其伪代码可在http://www.liacs.nl/home/kosters/pg/获取。

相似文献

An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

Bioinformatics. 2007 Mar 15;23(6):687-93. doi: 10.1093/bioinformatics/btl665. Epub 2007 Jan 19.

Mining sequential patterns for protein fold recognition.

J Biomed Inform. 2008 Feb;41(1):165-79. doi: 10.1016/j.jbi.2007.05.004. Epub 2007 May 17.

Bioinformatics. 2005 Sep 1;21 Suppl 2:ii42-6. doi: 10.1093/bioinformatics/bti1107.

Identification of putative domain linkers by a neural network - application to a large sequence database.

BMC Bioinformatics. 2006 Jun 27;7:323. doi: 10.1186/1471-2105-7-323.

Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization.

Biochem Biophys Res Commun. 2006 Aug 18;347(1):150-7. doi: 10.1016/j.bbrc.2006.06.059. Epub 2006 Jun 21.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

J Biomed Inform. 2008 Feb;41(1):65-81. doi: 10.1016/j.jbi.2007.05.010. Epub 2007 Jun 27.

The 3of5 web application for complex and comprehensive pattern matching in protein sequences.

BMC Bioinformatics. 2006 Mar 16;7:144. doi: 10.1186/1471-2105-7-144.

Mining frequent stem patterns from unaligned RNA sequences.

Bioinformatics. 2006 Oct 15;22(20):2480-7. doi: 10.1093/bioinformatics/btl431. Epub 2006 Aug 14.

引用本文的文献

Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants.

Genomics Proteomics Bioinformatics. 2022 Feb;20(1):205-218. doi: 10.1016/j.gpb.2021.03.007. Epub 2021 Jul 3.

PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction.

Genes (Basel). 2019 Jan 22;10(2):73. doi: 10.3390/genes10020073.

Using machine learning tools for protein database biocuration assistance.

Sci Rep. 2018 Jul 5;8(1):10148. doi: 10.1038/s41598-018-28330-z.

Systematic discovery of complex insertions and deletions in human cancers.

Nat Med. 2016 Jan;22(1):97-104. doi: 10.1038/nm.4002. Epub 2015 Dec 14.

Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors.

BMC Bioinformatics. 2015 Sep 29;16:314. doi: 10.1186/s12859-015-0731-9.

Expanding the computational toolbox for mining cancer genomes.

Nat Rev Genet. 2014 Aug;15(8):556-70. doi: 10.1038/nrg3767. Epub 2014 Jul 8.

PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data.

Bioinformatics. 2012 Feb 15;28(4):479-86. doi: 10.1093/bioinformatics/btr712. Epub 2012 Jan 4.

Analysis of next-generation genomic data in cancer: accomplishments and challenges.

Hum Mol Genet. 2010 Oct 15;19(R2):R188-96. doi: 10.1093/hmg/ddq391. Epub 2010 Sep 15.

Breaking the computational barrier: a divide-conquer and aggregate based approach for Alu insertion site characterisation.

Int J Comput Biol Drug Des. 2009;2(4):302-22. doi: 10.1504/IJCBDD.2009.030763. Epub 2009 Jan 4.

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.

Bioinformatics. 2009 Nov 1;25(21):2865-71. doi: 10.1093/bioinformatics/btp394. Epub 2009 Jun 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于挖掘未比对蛋白质序列中频繁模式的高效、通用且可扩展的模式增长方法。

An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献