基于快速模型的无需比对的蛋白质同源性检测。

Hochreiter Sepp, Heusel Martin, Obermayer Klaus

Institute of Bioinformatics, Johannes Kepler Universität Linz, 4040 Linz, Austria.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

MOTIVATION

As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best-performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines (SVMs) as well as by position-specific scoring matrices (PSSM) as obtained from PSI-BLAST. However, alignment methods are time consuming if a new sequence must be compared to many known sequences-the same holds for SVMs. Even more time consuming is to construct a PSSM for the new sequence. The best-performing methods would take about 25 days on present-day computers to classify the sequences of a new genome (20,000 genes) as belonging to just one specific class--however, there are hundreds of classes. Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure. We propose a fast model-based recurrent neural network for protein homology detection, the 'Long Short-Term Memory' (LSTM). LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment-based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices.

RESULTS

We have applied LSTM to a well known benchmark for remote protein homology detection, where a protein must be classified as belonging to a SCOP superfamily. LSTM reaches state-of-the-art classification performance but is considerably faster for classification than other approaches with comparable classification performance. LSTM is five orders of magnitude faster than methods which perform slightly better in classification and two orders of magnitude faster than the fastest SVM-based approaches (which, however, have lower classification performance than LSTM). Only PSI-BLAST and HMM-based methods show comparable time complexity as LSTM, but they cannot compete with LSTM in classification performance. To test the modeling capabilities of LSTM, we applied LSTM to PROSITE classes and interpreted the extracted patterns. In 8 out of 15 classes, LSTM automatically extracted the PROSITE motif. In the remaining 7 cases alternative motifs are generated which give better classification results on average than the PROSITE motifs.

AVAILABILITY

The LSTM algorithm is available from http://www.bioinf.jku.at/software/LSTM_protein/.

动机

随着越来越多的基因组被测序，对快速基因分类技术的需求日益增加。为了分析新测序的基因组，首先要识别基因并将其翻译成氨基酸序列，然后将这些序列分类为结构或功能类别。性能最佳的蛋白质分类方法基于使用序列比对方法的蛋白质同源性检测。比对方法最近通过诸如支持向量机（SVM）等判别方法以及从PSI-BLAST获得的位置特异性得分矩阵（PSSM）得到了增强。然而，如果要将一个新序列与许多已知序列进行比较，比对方法会很耗时——支持向量机也是如此。为新序列构建一个PSSM则更耗时。目前性能最佳的方法在当今的计算机上对一个新基因组（20,000个基因）的序列进行分类，使其仅属于一个特定类别大约需要25天——然而，有数百个类别。比对算法的另一个缺点是它们不构建正类的模型，而是测量序列或轮廓之间的相互距离。只有多重比对和隐马尔可夫模型是构建正类模型的流行分类方法，但它们的分类性能较低。模型的优点在于可以对类成员共有的化学性质进行分析，以获得对蛋白质功能和结构的新见解。我们提出了一种用于蛋白质同源性检测的基于模型的快速循环神经网络，即“长短期记忆”（LSTM）。LSTM自动提取正类的指示性模式，但与轮廓方法不同的是，它还提取负模式并使用所有检测到的模式之间的相关性进行分类。LSTM能够自动提取有用的局部和全局序列统计信息，如疏水性、极性、体积、极化率，并将它们与一种模式相结合。这些特性使LSTM成为基于比对方法的补充，因为它不使用像BLOSUM或PAM矩阵这样的预定义相似性度量。

结果

我们将LSTM应用于一个用于远程蛋白质同源性检测的知名基准测试，在该测试中，一种蛋白质必须被分类为属于一个SCOP超家族。LSTM达到了当前的分类性能，但在分类速度上比其他具有可比分类性能的方法要快得多。LSTM比在分类上稍好一点的方法快五个数量级，比最快的基于支持向量机的方法快两个数量级（然而，基于支持向量机的方法在分类性能上低于LSTM）。只有PSI-BLAST和基于隐马尔可夫模型的方法显示出与LSTM相当的时间复杂度，但它们在分类性能上无法与LSTM竞争。为了测试LSTM的建模能力，我们将LSTM应用于PROSITE类别并解释提取的模式。在15个类别中的8个类别中，LSTM自动提取了PROSITE基序。在其余7个案例中，生成了替代基序，这些基序平均而言比PROSITE基序给出了更好的分类结果。

可用性

LSTM算法可从http://www.bioinf.jku.at/software/LSTM_protein/获取。

相似文献

Fast model-based protein homology detection without alignment.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences.

BMC Bioinformatics. 2007 Mar 27;8:104. doi: 10.1186/1471-2105-8-104.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings.

Bioinformatics. 2007 May 15;23(10):1203-10. doi: 10.1093/bioinformatics/btm089. Epub 2007 Mar 22.

Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.

J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.

Profile-based direct kernels for remote homology detection and fold recognition.

Bioinformatics. 2005 Dec 1;21(23):4239-47. doi: 10.1093/bioinformatics/bti687. Epub 2005 Sep 27.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Incremental window-based protein sequence alignment algorithms.

Bioinformatics. 2007 Jan 15;23(2):e17-23. doi: 10.1093/bioinformatics/btl297.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

Prediction of protein subcellular localization.

Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

引用本文的文献

Major advances in protein function assignment by remote homolog detection with protein language models - A review.

Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.

Deep learning for optical tweezers.

Nanophotonics. 2024 May 23;13(17):3017-3035. doi: 10.1515/nanoph-2024-0013. eCollection 2024 Jul.

A privacy-preserving approach for cloud-based protein fold recognition.

Patterns (N Y). 2024 Jul 19;5(9):101023. doi: 10.1016/j.patter.2024.101023. eCollection 2024 Sep 13.

Exploring protein natural diversity in environmental microbiomes with DeepMetagenome.

Cell Rep Methods. 2024 Nov 18;4(11):100896. doi: 10.1016/j.crmeth.2024.100896. Epub 2024 Nov 7.

Deep learning in structural bioinformatics: current applications and future perspectives.

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae042.

Classification of DNA Sequence Based on a Non-gradient Algorithm: Pseudoinverse Learners.

Methods Mol Biol. 2024;2744:359-373. doi: 10.1007/978-1-0716-3581-0_23.

Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models.

Int J Mol Sci. 2023 Nov 1;24(21):15858. doi: 10.3390/ijms242115858.

Machine Learning Methods for Small Data Challenges in Molecular Science.

Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.

Deep self-supervised learning for biosynthetic gene cluster detection and product classification.

PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.

Sensor technologies for quality control in engineered tissue manufacturing.

Biofabrication. 2022 Oct 27;15(1). doi: 10.1088/1758-5090/ac94a1.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Fast model-based protein homology detection without alignment.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences.

BMC Bioinformatics. 2007 Mar 27;8:104. doi: 10.1186/1471-2105-8-104.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings.

Bioinformatics. 2007 May 15;23(10):1203-10. doi: 10.1093/bioinformatics/btm089. Epub 2007 Mar 22.

Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.

J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.

Profile-based direct kernels for remote homology detection and fold recognition.

Bioinformatics. 2005 Dec 1;21(23):4239-47. doi: 10.1093/bioinformatics/bti687. Epub 2005 Sep 27.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Incremental window-based protein sequence alignment algorithms.

Bioinformatics. 2007 Jan 15;23(2):e17-23. doi: 10.1093/bioinformatics/btl297.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

Prediction of protein subcellular localization.

Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

引用本文的文献

Major advances in protein function assignment by remote homolog detection with protein language models - A review.

Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.

Deep learning for optical tweezers.

Nanophotonics. 2024 May 23;13(17):3017-3035. doi: 10.1515/nanoph-2024-0013. eCollection 2024 Jul.

A privacy-preserving approach for cloud-based protein fold recognition.

Patterns (N Y). 2024 Jul 19;5(9):101023. doi: 10.1016/j.patter.2024.101023. eCollection 2024 Sep 13.

Exploring protein natural diversity in environmental microbiomes with DeepMetagenome.

Cell Rep Methods. 2024 Nov 18;4(11):100896. doi: 10.1016/j.crmeth.2024.100896. Epub 2024 Nov 7.

Deep learning in structural bioinformatics: current applications and future perspectives.

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae042.

Classification of DNA Sequence Based on a Non-gradient Algorithm: Pseudoinverse Learners.

Methods Mol Biol. 2024;2744:359-373. doi: 10.1007/978-1-0716-3581-0_23.

Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models.

Int J Mol Sci. 2023 Nov 1;24(21):15858. doi: 10.3390/ijms242115858.

Machine Learning Methods for Small Data Challenges in Molecular Science.

Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.

Deep self-supervised learning for biosynthetic gene cluster detection and product classification.

PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.

Sensor technologies for quality control in engineered tissue manufacturing.

Biofabrication. 2022 Oct 27;15(1). doi: 10.1088/1758-5090/ac94a1.

Fast model-based protein homology detection without alignment.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献