School of Computer Science and Technology, Tianjin University, Tianjin, 30050, China.
State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, 300074, China.
BMC Genomics. 2017 Oct 16;18(Suppl 7):742. doi: 10.1186/s12864-017-4128-1.
Cell-penetrating peptides (CPPs) are short peptides (5-30 amino acids) that can enter almost any cell without significant damage. On account of their high delivery efficiency, CPPs are promising candidates for gene therapy and cancer treatment. Accordingly, techniques that correctly predict CPPs are anticipated to accelerate CPP applications in future therapeutics. Recently, computational methods have been reportedly successful in predicting CPPs. Unfortunately, the predictive performance of existing methods is not satisfactory and reliable so as to accurately identify CPPs.
In this study, we propose a novel computational predictor called SkipCPP-Pred to further improve the predictive performance. The novelty of the proposed predictor is that we present a sequence-based feature representation algorithm called adaptive k-skip-n-gram that sufficiently captures the intrinsic correlation information of residues. By fusing the proposed adaptive skip features with a random forest (RF) classifier, we successfully construct the prediction model of SkipCPP-Pred. The various jackknife results demonstrate that the proposed SkipCPP-Pred is 3.6% higher than state-of-the-art CPP predictors in terms of accuracy. Moreover, we construct a high-quality benchmark dataset by reducing the data redundancy and enhancing the similarity between the positive and negative classes. Using this dataset to build prediction models, we can successfully avoid the performance bias lying in existing methods and yield a promising predictive model.
The proposed SkipCPP-Pred is a simple and fast sequence-based predictor featured with the adaptive k-skip-n-gram model for the improved prediction of CPPs. Currently, SkipCPP-Pred is publicly available from an online webserver ( http://server.malab.cn/SkipCPP-Pred/Index.html ).
细胞穿透肽(CPPs)是可以进入几乎任何细胞而不造成显著损伤的短肽(5-30 个氨基酸)。由于其高效的传递效率,CPPs 是基因治疗和癌症治疗的有前途的候选者。因此,正确预测 CPPs 的技术有望加速 CPP 在未来治疗中的应用。最近,据报道,计算方法在预测 CPPs 方面取得了成功。不幸的是,现有方法的预测性能并不令人满意和可靠,无法准确识别 CPPs。
在这项研究中,我们提出了一种名为 SkipCPP-Pred 的新型计算预测器,以进一步提高预测性能。所提出的预测器的新颖之处在于,我们提出了一种称为自适应 k-跳 n-gram 的基于序列的特征表示算法,该算法充分捕获了残基的内在相关信息。通过将所提出的自适应跳过特征与随机森林(RF)分类器融合,我们成功构建了 SkipCPP-Pred 的预测模型。各种折刀结果表明,在所提出的预测器在准确性方面比最先进的 CPP 预测器高出 3.6%。此外,我们通过减少数据冗余和增强正负类之间的相似性来构建一个高质量的基准数据集。使用该数据集构建预测模型,可以成功避免现有方法中的性能偏差,并产生有前途的预测模型。
所提出的 SkipCPP-Pred 是一种简单快速的基于序列的预测器,具有自适应 k-跳 n-gram 模型,用于改进 CPP 的预测。目前,SkipCPP-Pred 可从在线网络服务器(http://server.malab.cn/SkipCPP-Pred/Index.html)获得。