Wang LiQiang, Li CuiFeng
Department of Biochemistry and Molecular Biology, College of Life Science, Nankai University, Weijin Road 94, Tianjin, 300071, China,
Biotechnol Lett. 2014 Oct;36(10):1963-9. doi: 10.1007/s10529-014-1577-3. Epub 2014 Jun 15.
A genetic algorithm (GA) coupled with multiple linear regression (MLR) was used to extract useful features from amino acids and g-gap dipeptides for distinguishing between thermophilic and non-thermophilic proteins. The method was trained by a benchmark dataset of 915 thermophilic and 793 non-thermophilic proteins. The method reached an overall accuracy of 95.4 % in a Jackknife test using nine amino acids, 38 0-gap dipeptides and 29 1-gap dipeptides. The accuracy as a function of protein size ranged between 85.8 and 96.9 %. The overall accuracies of three independent tests were 93, 93.4 and 91.8 %. The observed results of detecting thermophilic proteins suggest that the GA-MLR approach described herein should be a powerful method for selecting features that describe thermostabile machines and be an aid in the design of more stable proteins.
将遗传算法(GA)与多元线性回归(MLR)相结合,用于从氨基酸和g-间隔二肽中提取有用特征,以区分嗜热蛋白和非嗜热蛋白。该方法通过包含915个嗜热蛋白和793个非嗜热蛋白的基准数据集进行训练。在使用9种氨基酸、38个0-间隔二肽和29个1-间隔二肽的留一法测试中,该方法的总体准确率达到了95.4%。准确率作为蛋白质大小的函数,范围在85.8%至96.9%之间。三项独立测试的总体准确率分别为93%、93.4%和91.8%。检测嗜热蛋白的观察结果表明,本文所述的GA-MLR方法应该是一种强大的方法,用于选择描述热稳定机制的特征,并有助于设计更稳定的蛋白质。