Zhang Shanxin, Han Jiuqiang, Liu Jun, Zheng Jiguang, Liu Ruiling
School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, PR China.
School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, PR China.
Comput Biol Chem. 2015 Feb;54:49-56. doi: 10.1016/j.compbiolchem.2014.12.001. Epub 2014 Dec 30.
Polyadenylation is the process of addition of poly(A) tail to mRNA 3' ends. Identification of motifs controlling polyadenylation plays an essential role in improving genome annotation accuracy and better understanding of the mechanisms governing gene regulation. The bioinformatics methods used for poly(A) motifs recognition have demonstrated that information extracted from sequences surrounding the candidate motifs can differentiate true motifs from the false ones greatly. However, these methods depend on either domain features or string kernels. To date, methods combining information from different sources have not been found yet. Here, we proposed an improved poly(A) motifs recognition method by combing different sources based on decision level fusion. First of all, two novel prediction methods was proposed based on support vector machine (SVM): one method is achieved by using the domain-specific features and principle component analysis (PCA) method to eliminate the redundancy (PCA-SVM); the other method is based on Oligo string kernel (Oligo-SVM). Then we proposed a novel machine-learning method for poly(A) motif prediction by marrying four poly(A) motifs recognition methods, including two state-of-the-art methods (Random Forest (RF) and HMM-SVM), and two novel proposed methods (PCA-SVM and Oligo-SVM). A decision level information fusion method was employed to combine the decision values of different classifiers by applying the DS evidence theory. We evaluated our method on a comprehensive poly(A) dataset that consists of 14,740 samples on 12 variants of poly(A) motifs and 2750 samples containing none of these motifs. Our method has achieved accuracy up to 86.13%. Compared with the four classifiers, our evidence theory based method reduces the average error rate by about 30%, 27%, 26% and 16%, respectively. The experimental results suggest that the proposed method is more effective for poly(A) motif recognition.
聚腺苷酸化是在mRNA 3'末端添加聚(A)尾巴的过程。识别控制聚腺苷酸化的基序对于提高基因组注释准确性和更好地理解基因调控机制起着至关重要的作用。用于聚(A)基序识别的生物信息学方法表明,从候选基序周围序列中提取的信息可以极大地将真正的基序与假基序区分开来。然而,这些方法要么依赖于结构域特征,要么依赖于字符串核。迄今为止,尚未发现结合不同来源信息的方法。在此,我们基于决策级融合提出了一种通过结合不同来源来改进聚(A)基序识别的方法。首先,基于支持向量机(SVM)提出了两种新颖的预测方法:一种方法是通过使用特定结构域特征和主成分分析(PCA)方法来消除冗余(PCA-SVM);另一种方法基于寡核苷酸字符串核(Oligo-SVM)。然后,我们通过结合四种聚(A)基序识别方法,包括两种最先进的方法(随机森林(RF)和HMM-SVM)以及两种新提出的方法(PCA-SVM和Oligo-SVM),提出了一种用于聚(A)基序预测的新颖机器学习方法。采用决策级信息融合方法,通过应用DS证据理论来组合不同分类器的决策值。我们在一个综合的聚(A)数据集上评估了我们的方法,该数据集由14740个关于12种聚(A)基序变体的样本和2750个不包含这些基序的样本组成。我们的方法达到了高达86.13%的准确率。与这四个分类器相比,我们基于证据理论的方法分别将平均错误率降低了约30%、27%、26%和16%。实验结果表明,所提出的方法对于聚(A)基序识别更有效。