Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, Koganei-shi, Tokyo 184-8588, Japan.
Bioinformatics. 2011 Feb 15;27(4):487-94. doi: 10.1093/bioinformatics/btq700. Epub 2010 Dec 17.
Biologically important proteins are often large, multidomain proteins, which are difficult to characterize by high-throughput experimental methods. Efficient domain/boundary predictions are thus increasingly required in diverse area of proteomics research for computationally dissecting proteins into readily analyzable domains.
We constructed a support vector machine (SVM)-based domain linker predictor, DROP (Domain linker pRediction using OPtimal features), which was trained with 25 optimal features. The optimal combination of features was identified from a set of 3000 features using a random forest algorithm complemented with a stepwise feature selection. DROP demonstrated a prediction sensitivity and precision of 41.3 and 49.4%, respectively. These values were over 19.9% higher than those of control SVM predictors trained with non-optimized features, strongly suggesting the efficiency of our feature selection method. In addition, the mean NDO-Score of DROP for predicting novel domains in seven CASP8 FM multidomain proteins was 0.760, which was higher than any of the 12 published CASP8 DP servers. Overall, these results indicate that the SVM prediction of domain linkers can be improved by identifying optimal features that best distinguish linker from non-linker regions.
生物重要的蛋白质通常是大型的、多结构域的蛋白质,这些蛋白质很难通过高通量实验方法进行特征描述。因此,在蛋白质组学研究的各个领域,高效的结构域/边界预测越来越受到需求,以通过计算将蛋白质分割成易于分析的结构域。
我们构建了一个基于支持向量机(SVM)的结构域连接预测器 DROP(使用最优特征进行结构域连接预测),它是使用 25 个最优特征进行训练的。最优特征的最佳组合是通过随机森林算法和逐步特征选择从 3000 个特征中确定的。DROP 的预测灵敏度和精度分别为 41.3%和 49.4%。这些值比使用非优化特征训练的对照 SVM 预测器高 19.9%以上,这强烈表明了我们的特征选择方法的效率。此外,DROP 预测七个 CASP8 FM 多结构域蛋白中新型结构域的平均 NDO 评分是 0.760,高于 12 个已发布的 CASP8 DP 服务器中的任何一个。总的来说,这些结果表明,通过识别最佳特征来区分连接子和非连接子区域,可以提高 SVM 对结构域连接子的预测。