Suppr超能文献

基于快速模型的无需比对的蛋白质同源性检测。

Fast model-based protein homology detection without alignment.

作者信息

Hochreiter Sepp, Heusel Martin, Obermayer Klaus

机构信息

Institute of Bioinformatics, Johannes Kepler Universität Linz, 4040 Linz, Austria.

出版信息

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

Abstract

MOTIVATION

As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best-performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines (SVMs) as well as by position-specific scoring matrices (PSSM) as obtained from PSI-BLAST. However, alignment methods are time consuming if a new sequence must be compared to many known sequences-the same holds for SVMs. Even more time consuming is to construct a PSSM for the new sequence. The best-performing methods would take about 25 days on present-day computers to classify the sequences of a new genome (20,000 genes) as belonging to just one specific class--however, there are hundreds of classes. Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure. We propose a fast model-based recurrent neural network for protein homology detection, the 'Long Short-Term Memory' (LSTM). LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment-based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices.

RESULTS

We have applied LSTM to a well known benchmark for remote protein homology detection, where a protein must be classified as belonging to a SCOP superfamily. LSTM reaches state-of-the-art classification performance but is considerably faster for classification than other approaches with comparable classification performance. LSTM is five orders of magnitude faster than methods which perform slightly better in classification and two orders of magnitude faster than the fastest SVM-based approaches (which, however, have lower classification performance than LSTM). Only PSI-BLAST and HMM-based methods show comparable time complexity as LSTM, but they cannot compete with LSTM in classification performance. To test the modeling capabilities of LSTM, we applied LSTM to PROSITE classes and interpreted the extracted patterns. In 8 out of 15 classes, LSTM automatically extracted the PROSITE motif. In the remaining 7 cases alternative motifs are generated which give better classification results on average than the PROSITE motifs.

AVAILABILITY

The LSTM algorithm is available from http://www.bioinf.jku.at/software/LSTM_protein/.

摘要

动机

随着越来越多的基因组被测序,对快速基因分类技术的需求日益增加。为了分析新测序的基因组,首先要识别基因并将其翻译成氨基酸序列,然后将这些序列分类为结构或功能类别。性能最佳的蛋白质分类方法基于使用序列比对方法的蛋白质同源性检测。比对方法最近通过诸如支持向量机(SVM)等判别方法以及从PSI-BLAST获得的位置特异性得分矩阵(PSSM)得到了增强。然而,如果要将一个新序列与许多已知序列进行比较,比对方法会很耗时——支持向量机也是如此。为新序列构建一个PSSM则更耗时。目前性能最佳的方法在当今的计算机上对一个新基因组(20,000个基因)的序列进行分类,使其仅属于一个特定类别大约需要25天——然而,有数百个类别。比对算法的另一个缺点是它们不构建正类的模型,而是测量序列或轮廓之间的相互距离。只有多重比对和隐马尔可夫模型是构建正类模型的流行分类方法,但它们的分类性能较低。模型的优点在于可以对类成员共有的化学性质进行分析,以获得对蛋白质功能和结构的新见解。我们提出了一种用于蛋白质同源性检测的基于模型的快速循环神经网络,即“长短期记忆”(LSTM)。LSTM自动提取正类的指示性模式,但与轮廓方法不同的是,它还提取负模式并使用所有检测到的模式之间的相关性进行分类。LSTM能够自动提取有用的局部和全局序列统计信息,如疏水性、极性、体积、极化率,并将它们与一种模式相结合。这些特性使LSTM成为基于比对方法的补充,因为它不使用像BLOSUM或PAM矩阵这样的预定义相似性度量。

结果

我们将LSTM应用于一个用于远程蛋白质同源性检测的知名基准测试,在该测试中,一种蛋白质必须被分类为属于一个SCOP超家族。LSTM达到了当前的分类性能,但在分类速度上比其他具有可比分类性能的方法要快得多。LSTM比在分类上稍好一点的方法快五个数量级,比最快的基于支持向量机的方法快两个数量级(然而,基于支持向量机的方法在分类性能上低于LSTM)。只有PSI-BLAST和基于隐马尔可夫模型的方法显示出与LSTM相当的时间复杂度,但它们在分类性能上无法与LSTM竞争。为了测试LSTM的建模能力,我们将LSTM应用于PROSITE类别并解释提取的模式。在15个类别中的8个类别中,LSTM自动提取了PROSITE基序。在其余7个案例中,生成了替代基序,这些基序平均而言比PROSITE基序给出了更好的分类结果。

可用性

LSTM算法可从http://www.bioinf.jku.at/software/LSTM_protein/获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验