• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于快速模型的无需比对的蛋白质同源性检测。

Fast model-based protein homology detection without alignment.

作者信息

Hochreiter Sepp, Heusel Martin, Obermayer Klaus

机构信息

Institute of Bioinformatics, Johannes Kepler Universität Linz, 4040 Linz, Austria.

出版信息

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

DOI:10.1093/bioinformatics/btm247
PMID:17488755
Abstract

MOTIVATION

As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best-performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines (SVMs) as well as by position-specific scoring matrices (PSSM) as obtained from PSI-BLAST. However, alignment methods are time consuming if a new sequence must be compared to many known sequences-the same holds for SVMs. Even more time consuming is to construct a PSSM for the new sequence. The best-performing methods would take about 25 days on present-day computers to classify the sequences of a new genome (20,000 genes) as belonging to just one specific class--however, there are hundreds of classes. Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure. We propose a fast model-based recurrent neural network for protein homology detection, the 'Long Short-Term Memory' (LSTM). LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment-based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices.

RESULTS

We have applied LSTM to a well known benchmark for remote protein homology detection, where a protein must be classified as belonging to a SCOP superfamily. LSTM reaches state-of-the-art classification performance but is considerably faster for classification than other approaches with comparable classification performance. LSTM is five orders of magnitude faster than methods which perform slightly better in classification and two orders of magnitude faster than the fastest SVM-based approaches (which, however, have lower classification performance than LSTM). Only PSI-BLAST and HMM-based methods show comparable time complexity as LSTM, but they cannot compete with LSTM in classification performance. To test the modeling capabilities of LSTM, we applied LSTM to PROSITE classes and interpreted the extracted patterns. In 8 out of 15 classes, LSTM automatically extracted the PROSITE motif. In the remaining 7 cases alternative motifs are generated which give better classification results on average than the PROSITE motifs.

AVAILABILITY

The LSTM algorithm is available from http://www.bioinf.jku.at/software/LSTM_protein/.

摘要

动机

随着越来越多的基因组被测序,对快速基因分类技术的需求日益增加。为了分析新测序的基因组,首先要识别基因并将其翻译成氨基酸序列,然后将这些序列分类为结构或功能类别。性能最佳的蛋白质分类方法基于使用序列比对方法的蛋白质同源性检测。比对方法最近通过诸如支持向量机(SVM)等判别方法以及从PSI-BLAST获得的位置特异性得分矩阵(PSSM)得到了增强。然而,如果要将一个新序列与许多已知序列进行比较,比对方法会很耗时——支持向量机也是如此。为新序列构建一个PSSM则更耗时。目前性能最佳的方法在当今的计算机上对一个新基因组(20,000个基因)的序列进行分类,使其仅属于一个特定类别大约需要25天——然而,有数百个类别。比对算法的另一个缺点是它们不构建正类的模型,而是测量序列或轮廓之间的相互距离。只有多重比对和隐马尔可夫模型是构建正类模型的流行分类方法,但它们的分类性能较低。模型的优点在于可以对类成员共有的化学性质进行分析,以获得对蛋白质功能和结构的新见解。我们提出了一种用于蛋白质同源性检测的基于模型的快速循环神经网络,即“长短期记忆”(LSTM)。LSTM自动提取正类的指示性模式,但与轮廓方法不同的是,它还提取负模式并使用所有检测到的模式之间的相关性进行分类。LSTM能够自动提取有用的局部和全局序列统计信息,如疏水性、极性、体积、极化率,并将它们与一种模式相结合。这些特性使LSTM成为基于比对方法的补充,因为它不使用像BLOSUM或PAM矩阵这样的预定义相似性度量。

结果

我们将LSTM应用于一个用于远程蛋白质同源性检测的知名基准测试,在该测试中,一种蛋白质必须被分类为属于一个SCOP超家族。LSTM达到了当前的分类性能,但在分类速度上比其他具有可比分类性能的方法要快得多。LSTM比在分类上稍好一点的方法快五个数量级,比最快的基于支持向量机的方法快两个数量级(然而,基于支持向量机的方法在分类性能上低于LSTM)。只有PSI-BLAST和基于隐马尔可夫模型的方法显示出与LSTM相当的时间复杂度,但它们在分类性能上无法与LSTM竞争。为了测试LSTM的建模能力,我们将LSTM应用于PROSITE类别并解释提取的模式。在15个类别中的8个类别中,LSTM自动提取了PROSITE基序。在其余7个案例中,生成了替代基序,这些基序平均而言比PROSITE基序给出了更好的分类结果。

可用性

LSTM算法可从http://www.bioinf.jku.at/software/LSTM_protein/获取。

相似文献

1
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
2
HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences.HMM-ModE——通过优化判别阈值并利用负训练序列修改发射概率,使用轮廓隐马尔可夫模型改进分类。
BMC Bioinformatics. 2007 Mar 27;8:104. doi: 10.1186/1471-2105-8-104.
3
SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.SVM-HUSTLE——一种用于成对蛋白质远程同源性检测的迭代半监督机器学习方法。
Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.
4
AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings.AutoSCOP:使用独特的模式-类别映射自动预测SCOP分类
Bioinformatics. 2007 May 15;23(10):1203-10. doi: 10.1093/bioinformatics/btm089. Epub 2007 Mar 22.
5
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.使用代表所有已知结构蛋白质的隐马尔可夫模型库将同源性分配给基因组序列。
J Mol Biol. 2001 Nov 2;313(4):903-19. doi: 10.1006/jmbi.2001.5080.
6
Profile-based direct kernels for remote homology detection and fold recognition.用于远程同源性检测和折叠识别的基于轮廓的直接内核。
Bioinformatics. 2005 Dec 1;21(23):4239-47. doi: 10.1093/bioinformatics/bti687. Epub 2005 Sep 27.
7
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
8
Incremental window-based protein sequence alignment algorithms.基于窗口递增的蛋白质序列比对算法。
Bioinformatics. 2007 Jan 15;23(2):e17-23. doi: 10.1093/bioinformatics/btl297.
9
transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign:利用氨基酸促进蛋白质编码DNA序列的多重比对。
BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.
10
Prediction of protein subcellular localization.蛋白质亚细胞定位预测
Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

引用本文的文献

1
Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述
Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.
2
Deep learning for optical tweezers.用于光镊的深度学习
Nanophotonics. 2024 May 23;13(17):3017-3035. doi: 10.1515/nanoph-2024-0013. eCollection 2024 Jul.
3
A privacy-preserving approach for cloud-based protein fold recognition.一种基于云的蛋白质折叠识别的隐私保护方法。
Patterns (N Y). 2024 Jul 19;5(9):101023. doi: 10.1016/j.patter.2024.101023. eCollection 2024 Sep 13.
4
Exploring protein natural diversity in environmental microbiomes with DeepMetagenome.用 DeepMetagenome 探索环境微生物组中的蛋白质自然多样性。
Cell Rep Methods. 2024 Nov 18;4(11):100896. doi: 10.1016/j.crmeth.2024.100896. Epub 2024 Nov 7.
5
Deep learning in structural bioinformatics: current applications and future perspectives.结构生物信息学中的深度学习:当前应用与未来展望。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae042.
6
Classification of DNA Sequence Based on a Non-gradient Algorithm: Pseudoinverse Learners.基于非梯度算法的 DNA 序列分类:伪逆学习者。
Methods Mol Biol. 2024;2744:359-373. doi: 10.1007/978-1-0716-3581-0_23.
7
Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models.深度学习在基因组学中的应用:从早期神经网络到现代大型语言模型。
Int J Mol Sci. 2023 Nov 1;24(21):15858. doi: 10.3390/ijms242115858.
8
Machine Learning Methods for Small Data Challenges in Molecular Science.机器学习方法在分子科学中小数据挑战中的应用。
Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.
9
Deep self-supervised learning for biosynthetic gene cluster detection and product classification.深度自监督学习在生物合成基因簇检测和产物分类中的应用。
PLoS Comput Biol. 2023 May 23;19(5):e1011162. doi: 10.1371/journal.pcbi.1011162. eCollection 2023 May.
10
Sensor technologies for quality control in engineered tissue manufacturing.用于工程化组织制造质量控制的传感器技术。
Biofabrication. 2022 Oct 27;15(1). doi: 10.1088/1758-5090/ac94a1.