Xu Y, Xu D, Gabow H N
Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830-6480, USA.
Bioinformatics. 2000 Dec;16(12):1091-104. doi: 10.1093/bioinformatics/16.12.1091.
Automatic decomposition of a multi-domain protein into individual domains represents a highly interesting and unsolved problem. As the number of protein structures in PDB is growing at an exponential rate, there is clearly a need for more reliable and efficient methods for protein domain decomposition simply to keep the domain databases up-to-date.
We present a new algorithm for solving the domain decomposition problem, using a graph-theoretic approach. We have formulated the problem as a network flow problem, in which each residue of a protein is represented as a node of the network and each residue--residue contact is represented as an edge with a particular capacity, depending on the type of the contact. A two-domain decomposition problem is solved by finding a bottleneck (or a minimum cut) of the network, which minimizes the total cross-edge capacity, using the classical Ford--Fulkerson algorithm. A multi-domain decomposition problem is solved through repeatedly solving a series of two-domain problems. The algorithm has been implemented as a computer program, called DomainParser. We have tested the program on a commonly used test set consisting of 55 proteins. The decomposition results are 78.2% in agreement with the literature on both the number of decomposed domains and the assignments of residues to each domain, which compares favorably to existing programs. On the subset of two-domain proteins (20 in number), the program assigned 96.7% of the residues correctly when we require that the number of decomposed domains is two.
将多结构域蛋白质自动分解为单个结构域是一个极具吸引力但尚未解决的问题。由于蛋白质数据银行(PDB)中蛋白质结构的数量呈指数级增长,显然需要更可靠、高效的蛋白质结构域分解方法,以使结构域数据库保持最新状态。
我们提出了一种新的算法来解决结构域分解问题,该算法采用图论方法。我们将该问题表述为一个网络流问题,其中蛋白质的每个残基表示为网络的一个节点,每个残基 - 残基接触表示为一条具有特定容量的边,具体容量取决于接触的类型。通过使用经典的福特 - 富尔克森算法找到网络的瓶颈(或最小割)来解决双结构域分解问题,该瓶颈可使总交叉边容量最小化。多结构域分解问题通过反复解决一系列双结构域问题来解决。该算法已实现为一个名为DomainParser的计算机程序。我们在一个由55种蛋白质组成的常用测试集上对该程序进行了测试。在分解结构域的数量以及每个结构域的残基分配方面,分解结果与文献的一致性为78.2%,这比现有程序表现更优。在双结构域蛋白质子集(共20个)上,当我们要求分解的结构域数量为两个时,该程序正确分配了96.7%的残基。