基于蛋白质序列的单链和双链DNA结合蛋白分析与预测

Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.

作者信息

Wang Wei, Sun Lin, Zhang Shiguang, Zhang Hongjun, Shi Jinling, Xu Tianhe, Li Keliang

机构信息

College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan Province, 453007, China.

Laboratory of Computation Intelligence and Information Processing, Engineering Technology Research Center for Computing Intelligence and Data Mining, Xinxiang, Henan Province, 453007, China.

出版信息

BMC Bioinformatics. 2017 Jun 12;18(1):300. doi: 10.1186/s12859-017-1715-8.

DOI:10.1186/s12859-017-1715-8

PMID:28606086

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5469069/

Abstract

BACKGROUND

DNA-binding proteins perform important functions in a great number of biological activities. DNA-binding proteins can interact with ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA), and DNA-binding proteins can be categorized as single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). The identification of DNA-binding proteins from amino acid sequences can help to annotate protein functions and understand the binding specificity. In this study, we systematically consider a variety of schemes to represent protein sequences: OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA), and then we adopt SVM (support vector machine) and RF (random forest) classification model to distinguish SSBs from DSBs.

RESULTS

Our results suggest that some sequence features can significantly differentiate DSBs and SSBs. Evaluated by 10 fold cross-validation on the benchmark datasets, our prediction method can achieve the accuracy of 88.7% and AUC (area under the curve) of 0.919. Moreover, our method has good performance in independent testing.

CONCLUSIONS

Using various sequence-derived features, a novel method is proposed to distinguish DSBs and SSBs accurately. The method also explores novel features, which could be helpful to discover the binding specificity of DNA-binding proteins.

摘要

背景

DNA结合蛋白在大量生物活动中发挥着重要作用。DNA结合蛋白可与单链DNA（ssDNA）或双链DNA（dsDNA）相互作用，且DNA结合蛋白可分为单链DNA结合蛋白（SSB）和双链DNA结合蛋白（DSB）。从氨基酸序列中识别DNA结合蛋白有助于注释蛋白质功能并理解结合特异性。在本研究中，我们系统地考虑了多种表示蛋白质序列的方案：整体氨基酸组成（OAAC）特征、二肽组成、位置特异性评分矩阵谱（PSSM）和分割氨基酸组成（SAA），然后采用支持向量机（SVM）和随机森林（RF）分类模型来区分SSB和DSB。

结果

我们的结果表明，一些序列特征能够显著区分DSB和SSB。在基准数据集上通过10折交叉验证进行评估，我们的预测方法可达到88.7%的准确率和0.919的曲线下面积（AUC）。此外，我们的方法在独立测试中表现良好。

结论

利用各种源自序列的特征，提出了一种准确区分DSB和SSB的新方法。该方法还探索了新的特征，这可能有助于发现DNA结合蛋白的结合特异性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于蛋白质序列的单链和双链DNA结合蛋白分析与预测

Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

基于蛋白质序列的单链和双链DNA结合蛋白分析与预测

Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献