Molécules thérapeutiques in silico (MTi), INSERM UMR-S973, University Paris Diderot, Paris 7, France.
Probability Statistique and Biology (PSB), LPMA laboratory, CNRS INSMI UMR 7599, University Pierre et Marie Curie, Paris 6, France.
PLoS One. 2018 Jul 5;13(7):e0198854. doi: 10.1371/journal.pone.0198854. eCollection 2018.
In this paper, we describe SAFlex (Structural Alphabet Flexibility), an extension of an existing structural alphabet (HMM-SA), to better explore increasing protein three dimensional structure information by encoding conformations of proteins in case of missing residues or uncertainties. An SA aims to reduce three dimensional conformations of proteins as well as their analysis and comparison complexity by simplifying any conformation in a series of structural letters. Our methodology presents several novelties. Firstly, it can account for the encoding uncertainty by providing a wide range of encoding options: the maximum a posteriori, the marginal posterior distribution, and the effective number of letters at each given position. Secondly, our new algorithm deals with the missing data in the protein structure files (concerning more than 75% of the proteins from the Protein Data Bank) in a rigorous probabilistic framework. Thirdly, SAFlex is able to encode and to build a consensus encoding from different replicates of a single protein such as several homomer chains. This allows localizing structural differences between different chains and detecting structural variability, which is essential for protein flexibility identification. These improvements are illustrated on different proteins, such as the crystal structure of an eukaryotic small heat shock protein. They are promising to explore increasing protein redundancy data and obtain useful quantification of their flexibility.
本文描述了 SAFlex(结构字母灵活性),它是现有结构字母(HMM-SA)的扩展,通过对缺失残基或不确定残基的蛋白质构象进行编码,更好地探索增加蛋白质三维结构信息。结构字母旨在通过将任何构象简化为一系列结构字母来简化蛋白质的三维构象及其分析和比较的复杂性。我们的方法具有几个新颖之处。首先,它可以通过提供广泛的编码选项来处理编码不确定性:最大后验概率、边际后验分布以及每个给定位置的有效字母数。其次,我们的新算法在严格的概率框架中处理蛋白质结构文件中的缺失数据(涉及蛋白质数据库中超过 75%的蛋白质)。第三,SAFlex 能够对单个蛋白质的多个同源链的单个蛋白质的不同重复进行编码并构建共识编码。这允许定位不同链之间的结构差异并检测结构可变性,这对于识别蛋白质灵活性至关重要。这些改进在不同的蛋白质上得到了说明,例如真核小分子热激蛋白的晶体结构。它们有望探索增加蛋白质冗余数据并获得其灵活性的有用定量。