DeepFam：基于深度学习的蛋白质家族建模和预测的无对齐方法。

DeepFam: deep learning based alignment-free method for protein family modeling and prediction.

机构信息

Department of Computer Science and Engineering, Seoul National University, Seoul, Korea.

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea.

出版信息

Bioinformatics. 2018 Jul 1;34(13):i254-i262. doi: 10.1093/bioinformatics/bty275.

DOI:10.1093/bioinformatics/bty275

PMID:29949966

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6022622/

Abstract

MOTIVATION

A large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k-mer based methods. Nevertheless, existing methods have some limitations; k-mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.

RESULTS

In this paper, we introduce DeepFam, an alignment-free method that can extract functional information directly from sequences without the need of multiple sequence alignments. In extensive experiments using the Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset, DeepFam achieved better performance in terms of accuracy and runtime for predicting functions of proteins compared to the state-of-the-art methods, both alignment-free and alignment-based methods. Additionally, we showed that DeepFam has a power of capturing conserved regions to model protein families. In fact, DeepFam was able to detect conserved regions documented in the Prosite database while predicting functions of proteins. Our deep learning method will be useful in characterizing functions of the ever increasing protein sequences.

AVAILABILITY AND IMPLEMENTATION

Codes are available at https://bhi-kimlab.github.io/DeepFam.

摘要

动机

新一代测序技术产生了大量新的蛋白质序列，而蛋白质的生化功能分配是一项重要任务。然而，生物实验太昂贵了，无法对如此大量的蛋白质序列进行特征描述，因此蛋白质功能预测主要是通过计算建模方法完成的，如轮廓隐马尔可夫模型（pHMM）和 k-mer 方法。然而，现有的方法存在一些局限性；k-mer 方法不足以准确地分配蛋白质功能，pHMM 不够快，无法处理来自众多基因组项目的大量蛋白质序列。因此，需要一种更准确、更快的蛋白质功能预测方法。

结果

在本文中，我们介绍了 DeepFam，这是一种无比对方法，可以直接从序列中提取功能信息，而无需进行多序列比对。在使用同源基因簇（COGs）和 G 蛋白偶联受体（GPCR）数据集的广泛实验中，DeepFam 在准确性和运行时间方面都优于最先进的方法，包括无比对和比对方法，用于预测蛋白质的功能。此外，我们表明，DeepFam 具有捕获保守区域以模拟蛋白质家族的能力。实际上，DeepFam 能够在预测蛋白质功能时检测到 Prosite 数据库中记录的保守区域。我们的深度学习方法将有助于描述不断增加的蛋白质序列的功能。

可用性和实现

代码可在 https://bhi-kimlab.github.io/DeepFam 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f81d/6022622/aead0edf8863/bty275f1.jpg

相似文献

DeepFam: deep learning based alignment-free method for protein family modeling and prediction.

Bioinformatics. 2018 Jul 1;34(13):i254-i262. doi: 10.1093/bioinformatics/bty275.

DeepNOG: fast and accurate protein orthologous group assignment.

Bioinformatics. 2021 Apr 1;36(22-23):5304-5312. doi: 10.1093/bioinformatics/btaa1051.

Fast model-based protein homology detection without alignment.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

Protein threading using residue co-variation and deep learning.

Bioinformatics. 2018 Jul 1;34(13):i263-i273. doi: 10.1093/bioinformatics/bty278.

EnsembleFam: towards more accurate protein family prediction in the twilight zone.

BMC Bioinformatics. 2022 Mar 14;23(1):90. doi: 10.1186/s12859-022-04626-w.

learnMSA2: deep protein multiple alignments with large language and hidden Markov models.

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

Amino Acids. 2018 Feb;50(2):255-266. doi: 10.1007/s00726-017-2512-4. Epub 2017 Nov 18.

PROMALS: towards accurate multiple sequence alignments of distantly related proteins.

Bioinformatics. 2007 Apr 1;23(7):802-8. doi: 10.1093/bioinformatics/btm017. Epub 2007 Jan 31.

DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.

Bioinformatics. 2018 Feb 15;34(4):660-668. doi: 10.1093/bioinformatics/btx624.

Sequence alignment using machine learning for accurate template-based protein structure prediction.

Bioinformatics. 2020 Jan 1;36(1):104-111. doi: 10.1093/bioinformatics/btz483.

引用本文的文献

DRBP-EDP: classification of DNA-binding proteins and RNA-binding proteins using ESM-2 and dual-path neural network.

NAR Genom Bioinform. 2025 May 19;7(2):lqaf058. doi: 10.1093/nargab/lqaf058. eCollection 2025 Jun.

NA_mCNN: Classification of Sodium Transporters in Membrane Proteins by Integrating Multi-Window Deep Learning and ProtTrans for Their Therapeutic Potential.

J Proteome Res. 2025 May 2;24(5):2324-2335. doi: 10.1021/acs.jproteome.4c00884. Epub 2025 Apr 7.

DeepEpiIL13: Deep Learning for Rapid and Accurate Prediction of IL-13-Inducing Epitopes Using Pretrained Language Models and Multiwindow Convolutional Neural Networks.

ACS Omega. 2025 Feb 26;10(9):9675-9683. doi: 10.1021/acsomega.4c10960. eCollection 2025 Mar 11.

A privacy-preserving dependable deep federated learning model for identifying new infections from genome sequences.

Sci Rep. 2025 Mar 1;15(1):7291. doi: 10.1038/s41598-025-89612-x.

RAG_MCNNIL6: A Retrieval-Augmented Multi-Window Convolutional Network for Accurate Prediction of IL-6 Inducing Epitopes.

J Chem Inf Model. 2025 Mar 10;65(5):2685-2694. doi: 10.1021/acs.jcim.4c02144. Epub 2025 Feb 19.

CaBind_MCNN: Identifying Potential Calcium Channel Blocker Targets by Predicting Calcium-Binding Sites in Ion Channels and Ion Transporters Using Protein Language Models and Multiscale Feature Extraction.

J Chem Inf Model. 2025 Feb 24;65(4):2145-2157. doi: 10.1021/acs.jcim.4c02252. Epub 2025 Feb 6.

Wolbachia-Based Approaches to Controlling Mosquito-Borne Viral Threats: Innovations, AI Integration, and Future Directions in the Context of Climate Change.

Viruses. 2024 Nov 30;16(12):1868. doi: 10.3390/v16121868.

Machine Learning Techniques to Infer Protein Structure and Function from Sequences: A Comprehensive Review.

Methods Mol Biol. 2025;2867:79-104. doi: 10.1007/978-1-0716-4196-5_5.

A community effort to optimize sequence-based deep learning models of gene regulation.

Nat Biotechnol. 2024 Oct 11. doi: 10.1038/s41587-024-02414-w.

Ten quick tips for ensuring machine learning model validity.

PLoS Comput Biol. 2024 Sep 19;20(9):e1012402. doi: 10.1371/journal.pcbi.1012402. eCollection 2024 Sep.

本文引用的文献

Alignment-free sequence comparison: benefits, applications, and tools.

Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

PLoS One. 2015 Nov 10;10(11):e0141287. doi: 10.1371/journal.pone.0141287. eCollection 2015.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Nat Biotechnol. 2015 Aug;33(8):831-8. doi: 10.1038/nbt.3300. Epub 2015 Jul 27.

Expanded microbial genome coverage and improved protein family annotation in the COG database.

Nucleic Acids Res. 2015 Jan;43(Database issue):D261-9. doi: 10.1093/nar/gku1223. Epub 2014 Nov 26.

New and continuing developments at PROSITE.

Nucleic Acids Res. 2013 Jan;41(Database issue):D344-7. doi: 10.1093/nar/gks1067. Epub 2012 Nov 17.

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.

MEME SUITE: tools for motif discovery and searching.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W202-8. doi: 10.1093/nar/gkp335. Epub 2009 May 20.

On the hierarchical classification of G protein-coupled receptors.

Bioinformatics. 2007 Dec 1;23(23):3113-8. doi: 10.1093/bioinformatics/btm506. Epub 2007 Oct 22.

Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors.

Genomics. 2007 May;89(5):602-12. doi: 10.1016/j.ygeno.2007.01.008. Epub 2007 Mar 2.

ARCS: an aggregated related column scoring scheme for aligned sequences.

Bioinformatics. 2006 Oct 1;22(19):2326-32. doi: 10.1093/bioinformatics/btl398. Epub 2006 Jul 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

DeepFam：基于深度学习的蛋白质家族建模和预测的无对齐方法。

DeepFam: deep learning based alignment-free method for protein family modeling and prediction.

机构信息

Department of Computer Science and Engineering, Seoul National University, Seoul, Korea.

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea.