Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, Yokohama, Kanagawa, Japan.
Proteins. 2020 Oct;88(10):1271-1284. doi: 10.1002/prot.25900. Epub 2020 May 28.
The infinitesimally small sequence space naturally scouted in the millions of years of evolution suggests that the natural proteins are constrained by some functional prerequisites and should differ from randomly generated sequences. We have developed a protein sequence fitness scoring function that implements sequence and corresponding secondary structural information at tripeptide levels to differentiate natural and nonnatural proteins. The proposed fitness function is extensively validated on a dataset of about 210 000 natural and nonnatural protein sequences and benchmarked with existing methods for differentiating natural and nonnatural proteins. The high sensitivity, specificity, and percentage accuracy (0.81%, 0.95%, and 91% respectively) of the fitness function demonstrates its potential application for sampling the protein sequences with higher probability of mimicking natural proteins. Moreover, the four major classes of proteins (α proteins, β proteins, α/β proteins, and α + β proteins) are separately analyzed and β proteins are found to score slightly lower as compared to other classes. Further, an analysis of about 250 designed proteins (adopted from previously reported cases) helped to define the boundaries for sampling the ideal protein sequences. The protein sequence characterization aided by the proposed fitness function could facilitate the exploration of new perspectives in the design of novel functional proteins.
在数百万年的进化中,自然探索了无穷小的序列空间,这表明天然蛋白质受到一些功能前提的限制,并且应该与随机生成的序列不同。我们已经开发了一种蛋白质序列适应性评分函数,该函数在三肽水平上实现了序列和相应的二级结构信息,以区分天然和非天然蛋白质。该拟合函数在大约 210000 个天然和非天然蛋白质序列的数据集上进行了广泛验证,并与现有方法进行了基准测试,以区分天然和非天然蛋白质。拟合函数的高灵敏度、特异性和准确率(分别为 0.81%、0.95%和 91%)表明,它具有潜在的应用价值,可以对更有可能模拟天然蛋白质的蛋白质序列进行抽样。此外,还对四大类蛋白质(α 蛋白、β 蛋白、α/β 蛋白和 α+β 蛋白)进行了单独分析,发现β 蛋白的得分略低于其他类别。此外,对大约 250 个已设计蛋白质(取自先前报道的案例)的分析有助于定义采样理想蛋白质序列的边界。拟议的拟合函数可以辅助对蛋白质序列进行特征描述,从而促进在设计新型功能蛋白质方面探索新的视角。