Suppr超能文献

通过特征选择改进核受体的分类。

Improving the classification of nuclear receptors with feature selection.

作者信息

Gao Qing-Bin, Jin Zhi-Chao, Ye Xiao-Fei, Wu Cheng, Lu Jian, He Jia

机构信息

Department of Health Statistics, Second Military Medical University, Shanghai 200433, China.

出版信息

Protein Pept Lett. 2009;16(7):823-9. doi: 10.2174/092986609788681733.

Abstract

Nuclear receptors are involved in multiple cellular signaling pathways that affect and regulate processes. Because of their physiology and pathophysiology significance, classification of nuclear receptors is essential for the proper understanding of their functions. Bhasin and Raghava have shown that the subfamilies of nuclear receptors are closely correlated with their amino acid composition and dipeptide composition [29]. They characterized each protein by a 400 dimensional feature vector. However, using high dimensional feature vectors for characterization of protein sequences will increase the computational cost as well as the risk of overfitting. Therefore, using only those features that are most relevant to the present task might improve the prediction system, and might also provide us with some biologically useful knowledge. In this paper a feature selection approach was proposed to identify relevant features and a prediction engine of support vector machines was developed to estimate the prediction accuracy of classification using the selected features. A reduced subset containing 30 features was accepted to characterize the protein sequences in view of its good discriminative power towards the classes, in which 18 are of amino acid composition and 12 are of dipeptide composition. This reduced feature subset resulted in an overall accuracy of 98.9% in a 5-fold cross-validation test, higher than 88.7% of amino acid composition based method and almost as high as 99.3% of dipeptide composition based method. Moreover, an overall accuracy of 93.7% was reached when it was evaluated on a blind data set of 63 nuclear receptors. On the other hand, an overall accuracy of 96.1% and 95.2% based on the reduced 12 dipeptide compositions was observed simultaneously in the 5-fold cross-validation test and the blind data set test, respectively. These results demonstrate the effectiveness of the present method.

摘要

核受体参与多种影响和调节细胞进程的细胞信号通路。由于其在生理学和病理生理学上的重要意义,核受体的分类对于正确理解其功能至关重要。巴辛和拉加瓦已经表明,核受体的亚家族与其氨基酸组成和二肽组成密切相关[29]。他们用一个400维的特征向量对每种蛋白质进行表征。然而,使用高维特征向量来表征蛋白质序列会增加计算成本以及过拟合的风险。因此,仅使用与当前任务最相关的那些特征可能会改进预测系统,并且还可能为我们提供一些生物学上有用的知识。在本文中,提出了一种特征选择方法来识别相关特征,并开发了一个支持向量机预测引擎,以使用所选特征来估计分类的预测准确性。考虑到其对类别具有良好的判别能力,接受了一个包含30个特征的缩减子集来表征蛋白质序列,其中18个是氨基酸组成特征,12个是二肽组成特征。在5折交叉验证测试中,这个缩减后的特征子集的总体准确率达到了98.9%,高于基于氨基酸组成方法的88.7%,几乎与基于二肽组成方法的99.3%一样高。此外,在一个由63个核受体组成的盲数据集上进行评估时,总体准确率达到了93.7%。另一方面,在5折交叉验证测试和盲数据集测试中,基于缩减后的12个二肽组成特征分别同时观察到总体准确率为96.1%和95.2%。这些结果证明了本方法的有效性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验