Anwar Firoz, Baker Syed Murtuza, Jabid Taskeed, Mehedi Hasan Md, Shoyaib Mohammad, Khan Haseena, Walshe Ray
Department of Computer Science and Engineering, East West University, Bangladesh.
BMC Bioinformatics. 2008 Oct 4;9:414. doi: 10.1186/1471-2105-9-414.
Eukaryotic promoter prediction using computational analysis techniques is one of the most difficult jobs in computational genomics that is essential for constructing and understanding genetic regulatory networks. The increased availability of sequence data for various eukaryotic organisms in recent years has necessitated for better tools and techniques for the prediction and analysis of promoters in eukaryotic sequences. Many promoter prediction methods and tools have been developed to date but they have yet to provide acceptable predictive performance. One obvious criteria to improve on current methods is to devise a better system for selecting appropriate features of promoters that distinguish them from non-promoters. Secondly improved performance can be achieved by enhancing the predictive ability of the machine learning algorithms used.
In this paper, a novel approach is presented in which 128 4-mer motifs in conjunction with a non-linear machine-learning algorithm utilising a Support Vector Machine (SVM) are used to distinguish between promoter and non-promoter DNA sequences. By applying this approach to plant, Drosophila, human, mouse and rat sequences, the classification model has showed 7-fold cross-validation percentage accuracies of 83.81%, 94.82%, 91.25%, 90.77% and 82.35% respectively. The high sensitivity and specificity value of 0.86 and 0.90 for plant; 0.96 and 0.92 for Drosophila; 0.88 and 0.92 for human; 0.78 and 0.84 for mouse and 0.82 and 0.80 for rat demonstrate that this technique is less prone to false positive results and exhibits better performance than many other tools. Moreover, this model successfully identifies location of promoter using TATA weight matrix.
The high sensitivity and specificity indicate that 4-mer frequencies in conjunction with supervised machine-learning methods can be beneficial in the identification of RNA pol II promoters comparative to other methods. This approach can be extended to identify promoters in sequences for other eukaryotic genomes.
利用计算分析技术进行真核生物启动子预测是计算基因组学中最困难的任务之一,对于构建和理解基因调控网络至关重要。近年来,各种真核生物序列数据的可用性不断增加,因此需要更好的工具和技术来预测和分析真核生物序列中的启动子。到目前为止,已经开发了许多启动子预测方法和工具,但它们尚未提供可接受的预测性能。改进当前方法的一个明显标准是设计一个更好的系统来选择启动子的适当特征,以将它们与非启动子区分开来。其次,可以通过提高所使用的机器学习算法的预测能力来实现性能提升。
本文提出了一种新方法,其中结合使用128个四联体基序和利用支持向量机(SVM)的非线性机器学习算法来区分启动子和非启动子DNA序列。通过将这种方法应用于植物、果蝇、人类、小鼠和大鼠序列,分类模型分别显示出7折交叉验证准确率为83.81%、94.82%、91.25%、90.77%和82.35%。植物的高灵敏度和特异性值分别为0.86和0.90;果蝇为0.96和0.92;人类为0.88和0.92;小鼠为0.78和0.84;大鼠为0.82和0.80,这表明该技术不太容易产生假阳性结果,并且比许多其他工具表现更好。此外,该模型使用TATA权重矩阵成功识别了启动子的位置。
高灵敏度和特异性表明,与其他方法相比,四联体频率结合监督机器学习方法有助于识别RNA聚合酶II启动子。这种方法可以扩展到识别其他真核生物基因组序列中的启动子。