Wang Hongchu, Hu Xuehai
Department of Mathemaitcs, South China Normal University, Guangzhou, 510631, P.R. of China.
College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, P.R. of China.
BMC Bioinformatics. 2015 Dec 3;16:402. doi: 10.1186/s12859-015-0828-1.
Nuclear receptors (NRs) form a large family of ligand-inducible transcription factors that regulate gene expressions involved in numerous physiological phenomena, such as embryogenesis, homeostasis, cell growth and death. These nuclear receptors-related pathways are important targets of marketed drugs. Therefore, the design of a reliable computational model for predicting NRs from amino acid sequence has now been a significant biomedical problem.
Conjoint triad feature (CTF) mainly considers neighbor relationships in protein sequences by encoding each protein sequence using the triad (continuous three amino acids) frequency distribution extracted from a 7-letter reduced alphabet. In addition, chaos game representation (CGR) can investigate the patterns hidden in protein sequences and visually reveal previously unknown structure. In this paper, three methods, CTF, CGR, amino acid composition (AAC), are applied to formulate the protein samples. By considering different combinations of three methods, we study seven groups of features, and each group is evaluated by the 10-fold cross-validation test. Meanwhile, a new non-redundant dataset containing 474 NR sequences and 500 non-NR sequences is built based on the latest NucleaRDB database. Comparing the results of numerical experiments, the group of combined features with CTF and AAC gets the best result with the accuracy of 96.30% for identifying NRs from non-NRs. Moreover, if it is classified as a NR, it will be further put into the second level, which will classify a NR into one of the eight main subfamilies. At the second level, the group of combined features with CTF and AAC also gets the best accuracy of 94.73%. Subsequently, the proposed predictor is compared with two existing methods, and the comparisons show that the accuracies of two levels significantly increase to 98.79% (NR-2L: 92.56 %; iNR-PhysChem: 98.18%; the first level) and 93.71% (NR-2L: 88.68%; iNR-PhysChem: 92.45%; the second level) with the introduction of our CTF-based method. Finally, each component of CTF features is analyzed via the statistical significant test, and a simplified model only with the resulting top-50 significant features achieves accuracy of 95.28%.
The experimental results demonstrate that our CTF-based method is an effective way for predicting nuclear receptor proteins. Furthermore, the top-50 significant features obtained from the statistical significant test are considered as the "intrinsic features" in predicting NRs based on the analysis of relative importance.
核受体(NRs)构成了一大类配体诱导型转录因子,可调节参与多种生理现象的基因表达,如胚胎发生、体内平衡、细胞生长和死亡。这些与核受体相关的途径是市售药物的重要靶点。因此,设计一种可靠的从氨基酸序列预测核受体的计算模型,现已成为一个重大的生物医学问题。
联合三联体特征(CTF)主要通过使用从7字母简化字母表中提取的三联体(连续三个氨基酸)频率分布对每个蛋白质序列进行编码,来考虑蛋白质序列中的相邻关系。此外,混沌游戏表示(CGR)可以研究隐藏在蛋白质序列中的模式,并直观地揭示以前未知的结构。在本文中,应用三种方法,即CTF、CGR、氨基酸组成(AAC)来构建蛋白质样本。通过考虑三种方法的不同组合,我们研究了七组特征,每组特征均通过10折交叉验证测试进行评估。同时,基于最新的NucleaRDB数据库构建了一个包含474个NR序列和500个非NR序列的新的非冗余数据集。比较数值实验结果,CTF和AAC组合的特征组在从非NR中识别NR时获得了最佳结果,准确率为96.30%。此外,如果将其分类为NR,它将被进一步放入第二级,该级将NR分类为八个主要亚家族之一。在第二级,CTF和AAC组合的特征组也获得了94.73%的最佳准确率。随后,将所提出的预测器与两种现有方法进行比较,比较结果表明,随着我们基于CTF的方法的引入,两级的准确率显著提高到98.79%(NR - 2L:92.56%;iNR - PhysChem:98.18%;第一级)和93.71%(NR - 2L:88.68%;iNR - PhysChem:92.45%;第二级)。最后,通过统计显著性检验分析CTF特征的每个组成部分,仅由产生的前50个显著特征组成的简化模型实现了95.28%的准确率。
实验结果表明,我们基于CTF的方法是预测核受体蛋白的有效方法。此外,基于相对重要性分析,从统计显著性检验中获得的前50个显著特征被视为预测NR时的“内在特征”。