Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA.
BMC Bioinformatics. 2022 Jul 19;23(1):283. doi: 10.1186/s12859-022-04829-1.
The information about the domain architecture of proteins is useful for studying protein structure and function. However, accurate prediction of protein domain boundaries (i.e., sequence regions separating two domains) from sequence remains a significant challenge. In this work, we develop a deep learning method based on multi-head U-Nets (called DistDom) to predict protein domain boundaries utilizing 1D sequence features and predicted 2D inter-residue distance map as input. The 1D features contain the evolutionary and physicochemical information of protein sequences, whereas the 2D distance map includes the structural information of proteins that was rarely used in domain boundary prediction before. The 1D and 2D features are processed by the 1D and 2D U-Nets respectively to generate hidden features. The hidden features are then used by the multi-head attention to predict the probability of each residue of a protein being in a domain boundary, leveraging both local and global information in the features. The residue-level domain boundary predictions can be used to classify proteins as single-domain or multi-domain proteins. It classifies the CASP14 single-domain and multi-domain targets at the accuracy of 75.9%, 13.28% more accurate than the state-of-the-art method. Tested on the CASP14 multi-domain protein targets with expert annotated domain boundaries, the average per-target F1 measure score of the domain boundary prediction by DistDom is 0.263, 29.56% higher than the state-of-the-art method.
蛋白质结构域的信息对于研究蛋白质结构和功能非常有用。然而,从序列准确预测蛋白质结构域边界(即分隔两个结构域的序列区域)仍然是一个重大挑战。在这项工作中,我们开发了一种基于多头 U-Net(称为 DistDom)的深度学习方法,利用一维序列特征和预测的二维残基间距离图作为输入来预测蛋白质结构域边界。一维特征包含蛋白质序列的进化和物理化学信息,而二维距离图包含蛋白质的结构信息,这在以前的结构域边界预测中很少使用。一维和二维特征分别由一维和二维 U-Net 处理,以生成隐藏特征。然后,多头注意力机制使用隐藏特征来预测蛋白质中每个残基位于结构域边界的概率,利用特征中的局部和全局信息。残基水平的结构域边界预测可用于将蛋白质分类为单域或多域蛋白质。它将 CASP14 的单域和多域目标分类的准确率提高到 75.9%,比最先进的方法准确 13.28%。在具有专家注释的结构域边界的 CASP14 多域蛋白质目标上进行测试,DistDom 预测的结构域边界的平均每个目标 F1 度量分数为 0.263,比最先进的方法高 29.56%。