Bioinformatics Lab., Department of Statistics, Rajshahi University, Rajshahi, 6205, Bangladesh.
Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
Comput Biol Chem. 2020 Apr;85:107238. doi: 10.1016/j.compbiolchem.2020.107238. Epub 2020 Feb 19.
Among the protein post-translational modifications (PTMs), ubiquitination is considered as one of the most significant processes which can regulate the cellular functions and various diseases. Identification of ubiquitination sites becomes important for understanding the mechanisms of ubiquitination-related biological processes. Both experimental and computational approaches are available for identifying ubiquitination sites based on protein sequences of different species. The experimental approaches are time-consuming, laborious and costly. In silico prediction is an alternative time saving, easier and cost-effective approach for identifying ubiquitination sites. Moreover, the sequence patterns in the different species around the ubiquitination sites are not similar which demands species-specific predictors. Therefore, in this study, we have proposed a novel computational method for identifying ubiquitination sites based on protein sequences of A. thaliana species which will be robust against outlying observations also. Through the comparative study of two encoding schemes and three classifiers, the random forest (RF) based predictor was selected as the best predictor under the CKSAAP encoding scheme with 1:1 ratio of positive and negative samples (i.e. ubiquitinated and non-ubiquitinated) in training dataset. The proposed predictor produced the area under the ROC curve (AUC score) as 0.91 and 0.86 for 5-fold cross-validation test with the training dataset and the independent test dataset of A. thaliana respectively. The proposed RF based predictor also performed much better than the other existing ubiquitination sites predictors for A. thaliana.
在蛋白质翻译后修饰(PTMs)中,泛素化被认为是可以调节细胞功能和各种疾病的最重要过程之一。鉴定泛素化位点对于理解泛素化相关生物过程的机制变得非常重要。基于不同物种的蛋白质序列,有实验和计算两种方法可用于鉴定泛素化位点。实验方法既耗时、费力又昂贵。基于序列的计算预测是一种替代方法,可以节省时间、更容易且具有成本效益,可用于鉴定泛素化位点。此外,泛素化位点周围不同物种的序列模式并不相似,这需要特定于物种的预测器。因此,在这项研究中,我们提出了一种基于拟南芥物种蛋白质序列鉴定泛素化位点的新的计算方法,该方法还具有稳健性,可抵抗异常值。通过对两种编码方案和三种分类器的比较研究,基于 CKSAAP 编码方案的随机森林(RF)预测器被选为阳性和阴性样本(即泛素化和非泛素化)比例为 1:1 的训练数据集的最佳预测器。该预测器在 5 折交叉验证测试中,对训练数据集和拟南芥的独立测试数据集的 AUC 分数分别为 0.91 和 0.86。该 RF 预测器也比其他现有的拟南芥泛素化位点预测器表现更好。