Suppr超能文献

EnsembleFam:迈向更准确地预测模糊区域中的蛋白质家族

EnsembleFam: towards more accurate protein family prediction in the twilight zone.

作者信息

Kabir Mohammad Neamul, Wong Limsoon

机构信息

Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore.

出版信息

BMC Bioinformatics. 2022 Mar 14;23(1):90. doi: 10.1186/s12859-022-04626-w.

Abstract

BACKGROUND

Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.

RESULTS

We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins.

CONCLUSIONS

EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted  using just sequence information with better accuracy than state-of-the-art methods.

摘要

背景

当前的蛋白质家族建模方法,如轮廓隐马尔可夫模型(pHMM)、基于k-mer的方法和基于深度学习的方法,由于与具有已知功能的参考蛋白质的序列相似性较低,无法为处于模糊区域的蛋白质提供非常准确的蛋白质功能预测。

结果

我们提出了一种新的方法EnsembleFam,旨在为处于模糊区域的蛋白质提供更好的功能预测。EnsembleFam使用从序列同源关系计算出的相似性和相异性特征来提取蛋白质家族的核心特征。EnsembleFam使用这些特征为每个家族训练三个独立的支持向量机(SVM)分类器,并进行集成预测,以将新蛋白质分类到这些家族中。使用直系同源簇(COG)数据集和G蛋白偶联受体(GPCR)数据集进行了广泛的实验。EnsembleFam不仅在整个数据集上优于现有方法,而且为处于模糊区域的蛋白质提供了更准确的预测。

结论

EnsembleFam是一种用于蛋白质家族建模的机器学习方法,可用于更好地识别序列同源性非常低的成员。使用EnsembleFam,仅通过序列信息就可以比现有方法更准确地预测蛋白质功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5152/8919565/288d27937669/12859_2022_4626_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验