Clayton School of Information Technology, Monash University, Clayton, VIC 3800, Australia.
Bioinformatics. 2011 Jul 1;27(13):i43-51. doi: 10.1093/bioinformatics/btr240.
Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features-that is, helices and strands of sheet-by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure.
蛋白质折叠模式的简单而简洁的表示形式为可视化、比较、分类、搜索和对齐结构数据提供了强大的抽象。结构通常通过用向量或线性段替换标准二级结构特征(即螺旋和片层链)来进行抽象。仅依赖标准二级结构可能会导致大量结构信息丢失。此外,简化的传统方法严重依赖于将二级结构分配给蛋白质坐标数据的外部方法的一致性和准确性。尽管存在许多自动识别二级结构的方法,但定义的不精确性以及实验结构数据中的错误和不一致性极大地限制了它们在生成可靠简化表示形式中的适用性,尤其是在结构比较方面。本文介绍了一种使用最小信息长度(MML)的优雅统计和归纳推理框架来描绘蛋白质结构的数学严谨算法。我们的方法对蛋白质坐标数据生成一致且具有统计学鲁棒性的分段线性解释,从而对结构进行了强大而简洁的表示。这种描绘完全独立于当前方法使用氢键模式或检查局部子结构几何形状的方法。实际上,与 MML 标准的应用一样,该方法没有参数和阈值,这与现有的经常受到这些限制的程序形成鲜明对比。对大量蛋白质的分析结果表明,该方法产生了一致的结构描绘,其中包括对应于标准二级结构的片段。