School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen), Shenzhen, 518172, Guangdong, People's Republic of China.
Warshel Institute for Computational Biology, The Chinese University of Hong Kong (Shenzhen), Shenzhen, 518172, Guangdong, People's Republic of China.
Sci Rep. 2020 Nov 24;10(1):20447. doi: 10.1038/s41598-020-77173-0.
Lysine crotonylation (Kcr) is a type of protein post-translational modification (PTM), which plays important roles in a variety of cellular regulation and processes. Several methods have been proposed for the identification of crotonylation. However, most of these methods can predict efficiently only on histone or non-histone protein. Therefore, this work aims to give a more balanced performance in different species, here plant (non-histone) and mammalian (histone) are involved. SVM (support vector machine) and RF (random forest) were employed in this study. According to the results of cross-validations, the RF classifier based on EGAAC attribute achieved the best predictive performance which performs competitively good as existed methods, meanwhile more robust when dealing with imbalanced datasets. Moreover, an independent test was carried out, which compared the performance of this study and existed methods based on the same features or the same classifier. The classifiers of SVM and RF could achieve best performances with 92% sensitivity, 88% specificity, 90% accuracy, and an MCC of 0.80 in the mammalian dataset, and 77% sensitivity, 83% specificity, 70% accuracy and 0.54 MCC in a relatively small dataset of mammalian and a large-scaled plant dataset respectively. Moreover, a cross-species independent testing was also carried out in this study, which has proved the species diversity in plant and mammalian.
赖氨酸丁酰化(Kcr)是一种蛋白质翻译后修饰(PTM),在各种细胞调节和过程中发挥着重要作用。已经提出了几种鉴定丙二酰化的方法。然而,这些方法中的大多数仅能有效地预测组蛋白或非组蛋白蛋白。因此,这项工作旨在为不同物种(包括植物(非组蛋白)和哺乳动物(组蛋白))提供更平衡的性能。本研究中使用了 SVM(支持向量机)和 RF(随机森林)。根据交叉验证的结果,基于 EGAAC 属性的 RF 分类器实现了最佳的预测性能,其性能与现有方法相当,同时在处理不平衡数据集时更稳健。此外,进行了一项独立测试,该测试基于相同的特征或相同的分类器比较了本研究与现有方法的性能。SVM 和 RF 分类器在哺乳动物数据集上分别实现了最佳性能,灵敏度为 92%,特异性为 88%,准确率为 90%,MCC 为 0.80,在哺乳动物的相对较小数据集和大规模植物数据集上的灵敏度分别为 77%,特异性为 83%,准确率为 70%,MCC 为 0.54。此外,本研究还进行了跨物种独立测试,证明了植物和哺乳动物中的物种多样性。