Kabir Mohammad Neamul, Wong Limsoon
Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore.
BMC Bioinformatics. 2022 Mar 14;23(1):90. doi: 10.1186/s12859-022-04626-w.
Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.
We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins.
EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.
当前的蛋白质家族建模方法,如轮廓隐马尔可夫模型(pHMM)、基于k-mer的方法和基于深度学习的方法,由于与具有已知功能的参考蛋白质的序列相似性较低,无法为处于模糊区域的蛋白质提供非常准确的蛋白质功能预测。
我们提出了一种新的方法EnsembleFam,旨在为处于模糊区域的蛋白质提供更好的功能预测。EnsembleFam使用从序列同源关系计算出的相似性和相异性特征来提取蛋白质家族的核心特征。EnsembleFam使用这些特征为每个家族训练三个独立的支持向量机(SVM)分类器,并进行集成预测,以将新蛋白质分类到这些家族中。使用直系同源簇(COG)数据集和G蛋白偶联受体(GPCR)数据集进行了广泛的实验。EnsembleFam不仅在整个数据集上优于现有方法,而且为处于模糊区域的蛋白质提供了更准确的预测。
EnsembleFam是一种用于蛋白质家族建模的机器学习方法,可用于更好地识别序列同源性非常低的成员。使用EnsembleFam,仅通过序列信息就可以比现有方法更准确地预测蛋白质功能。