Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Bioinformatics. 2018 Sep 1;34(17):i773-i780. doi: 10.1093/bioinformatics/bty585.
Given a protein of unknown function, fast identification of similar protein structures from the Protein Data Bank (PDB) is a critical step for inferring its biological function. Such structural neighbors can provide evolutionary insights into protein conformation, interfaces and binding sites that are not detectable from sequence similarity. However, the computational cost of performing pairwise structural alignment against all structures in PDB is prohibitively expensive. Alignment-free approaches have been introduced to enable fast but coarse comparisons by representing each protein as a vector of structure features or fingerprints and only computing similarity between vectors. As a notable example, FragBag represents each protein by a 'bag of fragments', which is a vector of frequencies of contiguous short backbone fragments from a predetermined library. Despite being efficient, the accuracy of FragBag is unsatisfactory because its backbone fragment library may not be optimally constructed and long-range interacting patterns are omitted.
Here we present a new approach to learning effective structural motif presentations using deep learning. We develop DeepFold, a deep convolutional neural network model to extract structural motif features of a protein structure. We demonstrate that DeepFold substantially outperforms FragBag on protein structural search on a non-redundant protein structure database and a set of newly released structures. Remarkably, DeepFold not only extracts meaningful backbone segments but also finds important long-range interacting motifs for structural comparison. We expect that DeepFold will provide new insights into the evolution and hierarchical organization of protein structural motifs.
给定一个未知功能的蛋白质,快速识别蛋白质数据库(PDB)中的相似蛋白质结构对于推断其生物学功能是至关重要的。这些结构邻居可以提供关于蛋白质构象、界面和结合位点的进化见解,而这些信息是无法从序列相似性中检测到的。然而,对 PDB 中的所有结构执行成对结构比对的计算成本非常昂贵。因此,引入了无比对方法,通过将每个蛋白质表示为结构特征或指纹的向量,并仅计算向量之间的相似性,从而实现快速但粗略的比较。作为一个显著的例子,FragBag 通过“片段袋”来表示每个蛋白质,这是一个来自预定库的连续短骨架片段频率的向量。尽管 FragBag 效率很高,但它的准确性并不令人满意,因为它的骨架片段库可能没有得到最佳构建,并且忽略了长程相互作用模式。
在这里,我们提出了一种使用深度学习学习有效结构基元表示的新方法。我们开发了 DeepFold,这是一种深度卷积神经网络模型,用于提取蛋白质结构的结构基元特征。我们证明,在非冗余蛋白质结构数据库和一组新发布的结构上进行蛋白质结构搜索时,DeepFold 大大优于 FragBag。值得注意的是,DeepFold 不仅提取了有意义的骨架片段,而且还找到了用于结构比较的重要长程相互作用基元。我们期望 DeepFold 将为蛋白质结构基元的进化和层次组织提供新的见解。