EnsembleFam：迈向更准确地预测模糊区域中的蛋白质家族

EnsembleFam: towards more accurate protein family prediction in the twilight zone.

作者信息

Kabir Mohammad Neamul, Wong Limsoon

机构信息

Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore.

出版信息

BMC Bioinformatics. 2022 Mar 14;23(1):90. doi: 10.1186/s12859-022-04626-w.

DOI:10.1186/s12859-022-04626-w

PMID:35287576

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8919565/

Abstract

BACKGROUND

Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.

RESULTS

We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins.

CONCLUSIONS

EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.

摘要

背景

当前的蛋白质家族建模方法，如轮廓隐马尔可夫模型（pHMM）、基于k-mer的方法和基于深度学习的方法，由于与具有已知功能的参考蛋白质的序列相似性较低，无法为处于模糊区域的蛋白质提供非常准确的蛋白质功能预测。

结果

我们提出了一种新的方法EnsembleFam，旨在为处于模糊区域的蛋白质提供更好的功能预测。EnsembleFam使用从序列同源关系计算出的相似性和相异性特征来提取蛋白质家族的核心特征。EnsembleFam使用这些特征为每个家族训练三个独立的支持向量机（SVM）分类器，并进行集成预测，以将新蛋白质分类到这些家族中。使用直系同源簇（COG）数据集和G蛋白偶联受体（GPCR）数据集进行了广泛的实验。EnsembleFam不仅在整个数据集上优于现有方法，而且为处于模糊区域的蛋白质提供了更准确的预测。

结论

EnsembleFam是一种用于蛋白质家族建模的机器学习方法，可用于更好地识别序列同源性非常低的成员。使用EnsembleFam，仅通过序列信息就可以比现有方法更准确地预测蛋白质功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5152/8919565/288d27937669/12859_2022_4626_Fig1_HTML.jpg

相似文献

EnsembleFam: towards more accurate protein family prediction in the twilight zone.EnsembleFam：迈向更准确地预测模糊区域中的蛋白质家族

BMC Bioinformatics. 2022 Mar 14;23(1):90. doi: 10.1186/s12859-022-04626-w.

SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.SCPRED：对与预测序列具有模糊相似性的序列的蛋白质结构类别进行准确预测。

BMC Bioinformatics. 2008 May 1;9:226. doi: 10.1186/1471-2105-9-226.

DeepFam: deep learning based alignment-free method for protein family modeling and prediction.DeepFam：基于深度学习的蛋白质家族建模和预测的无对齐方法。

Bioinformatics. 2018 Jul 1;34(13):i254-i262. doi: 10.1093/bioinformatics/bty275.

Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences.从与预测序列具有 twilight-zone 身份的序列中预测蛋白质结构类别

BMC Bioinformatics. 2009 Dec 13;10:414. doi: 10.1186/1471-2105-10-414.

Prediction of protein secondary structure content for the twilight zone sequences.预测处于模糊区域序列的蛋白质二级结构含量。

Proteins. 2007 Nov 15;69(3):486-98. doi: 10.1002/prot.21527.

An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier.一种通过整合基于模板的分配和支持向量机分类器进行蛋白质折叠分类的集成方法。

Bioinformatics. 2017 Mar 15;33(6):863-870. doi: 10.1093/bioinformatics/btw768.

Prediction of protein binding sites in protein structures using hidden Markov support vector machine.利用隐马尔可夫支持向量机预测蛋白质结构中的蛋白质结合位点。

BMC Bioinformatics. 2009 Nov 20;10:381. doi: 10.1186/1471-2105-10-381.

GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble.GPCR-MPredictor：基于遗传集成的 G 蛋白偶联受体多层次预测

Amino Acids. 2012 May;42(5):1809-23. doi: 10.1007/s00726-011-0902-6. Epub 2011 Apr 20.

Physicochemical Evaluation of Remote Homology in the Twilight Zone.近缘关系模糊区域中远程同源性的物理化学评估

Proteins. 2025 Feb;93(2):452-464. doi: 10.1002/prot.26742. Epub 2024 Sep 1.

Enhanced Protein Structural Class Prediction Using Effective Feature Modeling and Ensemble of Classifiers.利用有效的特征建模和分类器集成增强蛋白质结构类预测。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2409-2419. doi: 10.1109/TCBB.2020.2979430. Epub 2021 Dec 8.

引用本文的文献

Exploiting the similarity of dissimilarities for biomedical applications and enhanced machine learning.利用差异中的相似性用于生物医学应用及增强机器学习。

PLoS Comput Biol. 2025 Jan 24;21(1):e1012716. doi: 10.1371/journal.pcbi.1012716. eCollection 2025 Jan.

Ten quick tips for ensuring machine learning model validity.确保机器学习模型有效性的十个快速技巧。

PLoS Comput Biol. 2024 Sep 19;20(9):e1012402. doi: 10.1371/journal.pcbi.1012402. eCollection 2024 Sep.

Cross-phyla protein annotation by structural prediction and alignment.跨门蛋白质注释通过结构预测和比对。

Genome Biol. 2023 May 12;24(1):113. doi: 10.1186/s13059-023-02942-9.

本文引用的文献

PANNZER-A practical tool for protein function prediction.PANNZER——一种用于蛋白质功能预测的实用工具。

Protein Sci. 2022 Jan;31(1):118-128. doi: 10.1002/pro.4193. Epub 2021 Oct 14.

QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs.QAUST：利用结构相似性、蛋白质相互作用和功能基序进行蛋白质功能预测

Genomics Proteomics Bioinformatics. 2021 Dec;19(6):998-1011. doi: 10.1016/j.gpb.2021.02.001. Epub 2021 Feb 23.

UDSMProt: universal deep sequence models for protein classification.UDSMProt：用于蛋白质分类的通用深度序列模型。

Bioinformatics. 2020 Apr 15;36(8):2401-2409. doi: 10.1093/bioinformatics/btaa003.

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称，通过实验筛选，提高了数百个基因的蛋白质功能预测和新的功能注释。

Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.

FunFam protein families improve residue level molecular function prediction.FunFam 蛋白家族可提高残基水平的分子功能预测。

BMC Bioinformatics. 2019 Jul 18;20(1):400. doi: 10.1186/s12859-019-2988-x.

Genomes OnLine database (GOLD) v.7: updates and new features.基因组在线数据库（GOLD）v.7：更新和新功能。

Nucleic Acids Res. 2019 Jan 8;47(D1):D649-D659. doi: 10.1093/nar/gky977.

The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

DeepFam: deep learning based alignment-free method for protein family modeling and prediction.DeepFam：基于深度学习的蛋白质家族建模和预测的无对齐方法。

Bioinformatics. 2018 Jul 1;34(13):i254-i262. doi: 10.1093/bioinformatics/bty275.

HMMER web server: 2018 update.HMMER 网页服务器：2018 年更新。

Nucleic Acids Res. 2018 Jul 2;46(W1):W200-W204. doi: 10.1093/nar/gky448.

GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank.GOLabeler：通过学习排序提高基于序列的大规模蛋白质功能预测。

Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

EnsembleFam：迈向更准确地预测模糊区域中的蛋白质家族

EnsembleFam: towards more accurate protein family prediction in the twilight zone.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献