Huan J, Wang W, Washington A, Prins J, Shah R, Tropsha A
Department of Computer Science, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA.
Pac Symp Biocomput. 2004:411-22. doi: 10.1142/9789812704856_0039.
Protein structural annotation and classification is an important problem in bioinformatics. We report on the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. To address these challenges, we have developed an information theoretic model called coherent subgraph mining. From information theory, the entropy of a random variable X measures the information content carried by X and the Mutual Information (MI) between two random variables X and Y measures the correlation between X and Y. We define a subgraph X as coherent if it is strongly correlated with every sufficiently large sub-subgraph Y embedded in it. Based on the MI metric, we have designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent protein subgraphs, we have conducted an experimental study in which all coherent subgraphs were identified in several protein structural families annotated in the SCOP database (Murzin et al, 1995). The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We find that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families.
蛋白质结构注释与分类是生物信息学中的一个重要问题。我们报告了一种高效子图挖掘技术的开发及其在寻找蛋白质结构家族内特征子结构模式中的应用。在我们的方法中,蛋白质结构由图表示,其中节点是残基,边连接相互之间在一定距离内发现的残基。将子图挖掘应用于蛋白质存在诸多挑战:(1)蛋白质图庞大且复杂;(2)当前蛋白质数据库规模巨大且持续快速增长;(3)在所有可能子图的巨大集合中,只有一小部分频繁子图在蛋白质分类背景下可能具有重要意义。为应对这些挑战,我们开发了一种名为相干子图挖掘的信息论模型。从信息论角度来看,随机变量X的熵衡量X所携带的信息内容,两个随机变量X和Y之间的互信息(MI)衡量X和Y之间的相关性。如果子图X与嵌入其中的每个足够大的子子图Y高度相关,我们就将子图X定义为相干的。基于MI度量,我们设计了一种仅报告相干子图的搜索方案。为了确定相干蛋白质子图的重要性,我们进行了一项实验研究,在SCOP数据库(Murzin等人,1995年)中注释的几个蛋白质结构家族中识别出所有相干子图。使用支持向量机算法在二元分类方案下对来自不同家族的蛋白质进行分类。我们发现这种方法能够识别各个SCOP家族特有的空间基序,并在家族之间提供出色的区分能力。