DNABind：一种基于机器学习和模板的混合算法，用于预测基于结构的 DNA 结合残基。

DNABind: a hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.

机构信息

Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina, 29208; Center for Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430070, People's Republic of China.

出版信息

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.

DOI:10.1002/prot.24330

PMID:23737141

Abstract

Accurate prediction of DNA-binding residues has become a problem of increasing importance in structural bioinformatics. Here, we presented DNABind, a novel hybrid algorithm for identifying these crucial residues by exploiting the complementarity between machine learning- and template-based methods. Our machine learning-based method was based on the probabilistic combination of a structure-based and a sequence-based predictor, both of which were implemented using support vector machines algorithms. The former included our well-designed structural features, such as solvent accessibility, local geometry, topological features, and relative positions, which can effectively quantify the difference between DNA-binding and nonbinding residues. The latter combined evolutionary conservation features with three other sequence attributes. Our template-based method depended on structural alignment and utilized the template structure from known protein-DNA complexes to infer DNA-binding residues. We showed that the template method had excellent performance when reliable templates were found for the query proteins but tended to be strongly influenced by the template quality as well as the conformational changes upon DNA binding. In contrast, the machine learning approach yielded better performance when high-quality templates were not available (about 1/3 cases in our dataset) or the query protein was subject to intensive transformation changes upon DNA binding. Our extensive experiments indicated that the hybrid approach can distinctly improve the performance of the individual methods for both bound and unbound structures. DNABind also significantly outperformed the state-of-art algorithms by around 10% in terms of Matthews's correlation coefficient. The proposed methodology could also have wide application in various protein functional site annotations. DNABind is freely available at http://mleg.cse.sc.edu/DNABind/.

摘要

准确预测 DNA 结合残基已成为结构生物信息学中日益重要的问题。在这里，我们提出了 DNABind，这是一种通过利用基于机器学习和基于模板的方法之间的互补性来识别这些关键残基的新型混合算法。我们的基于机器学习的方法基于基于结构和基于序列的预测器的概率组合，这两种预测器都使用支持向量机算法实现。前者包括我们精心设计的结构特征，如溶剂可及性、局部几何形状、拓扑特征和相对位置，这些特征可以有效地量化 DNA 结合和非结合残基之间的差异。后者将进化保守特征与其他三个序列属性相结合。我们的基于模板的方法依赖于结构比对，并利用来自已知蛋白-DNA 复合物的模板结构来推断 DNA 结合残基。我们表明，当为查询蛋白找到可靠的模板时，模板方法具有出色的性能，但容易受到模板质量以及 DNA 结合时的构象变化的强烈影响。相比之下，当高质量的模板不可用时（在我们的数据集大约 1/3 的情况下）或查询蛋白在 DNA 结合时受到强烈的变形变化时，基于机器学习的方法产生更好的性能。我们的广泛实验表明，混合方法可以明显提高个体方法在结合和未结合结构上的性能。DNABind 在 Matthews 相关系数方面也比最先进的算法提高了约 10%。所提出的方法也可以在各种蛋白功能位点注释中广泛应用。DNABind 可在 http://mleg.cse.sc.edu/DNABind/ 上免费获得。