Ji Guoli, Wu Xiaohui, Shen Yingjia, Huang Jiangyin, Quinn Li Qingshun
Department of Automation, Xiamen University, Xiamen 361000, China.
J Theor Biol. 2010 Aug 7;265(3):287-96. doi: 10.1016/j.jtbi.2010.05.015. Epub 2010 May 26.
Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [(poly(A) site] marks the end of a transcript, which is also the end of a gene. A computation program that is able to recognize poly(A) sites would not only prove useful for genome annotation in finding genes ends, but also for predicting alternative poly(A) sites. Features that define the poly(A) sites can now be extracted from the poly(A) site datasets to build such predictive models. Using methods, including K-gram pattern, Z-curve, position-specific scoring matrix and first-order inhomogeneous Markov sub-model, numerous features were generated and placed in an original feature space. To select the most useful features, attribute selection algorithms, such as information gain and entropy, were employed. A training model was then built based on the Bayesian network to determine a subset of the optimal features. Test models corresponding to the training models were built to predict poly(A) sites in Arabidopsis and rice. Thus, a prediction model, termed Poly(A) site classifier, or PAC, was constructed. The uniqueness of the model lies in its structure in that each sub-model can be replaced or expanded, while feature generation, selection and classification are all independent processes. Its modular design makes it easily adaptable to different species or datasets. The algorithm's high specificity and sensitivity were demonstrated by testing several datasets and, at the best combinations, they both reached 95%. The software package may be used for genome annotation and optimizing transgene structure.
信使核糖核酸(mRNA)聚腺苷酸化是真核基因表达过程中必不可少的加工步骤之一。聚腺苷酸化位点[poly(A)位点]标志着转录本的末端,也就是基因的末端。一个能够识别poly(A)位点的计算程序不仅在寻找基因末端的基因组注释中有用,而且在预测可变poly(A)位点方面也很有用。现在可以从poly(A)位点数据集中提取定义poly(A)位点的特征,以构建这样的预测模型。使用包括K-gram模式、Z曲线、位置特异性评分矩阵和一阶非齐次马尔可夫子模型在内的方法,生成了大量特征并将其置于原始特征空间中。为了选择最有用的特征,采用了信息增益和熵等属性选择算法。然后基于贝叶斯网络构建训练模型,以确定最优特征的一个子集。构建了与训练模型相对应的测试模型,以预测拟南芥和水稻中的poly(A)位点。因此,构建了一个称为Poly(A)位点分类器(PAC)的预测模型。该模型的独特之处在于其结构,即每个子模型都可以被替换或扩展,而特征生成、选择和分类都是独立的过程。其模块化设计使其很容易适应不同的物种或数据集。通过对几个数据集进行测试,证明了该算法具有很高的特异性和敏感性,在最佳组合下,两者均达到95%。该软件包可用于基因组注释和优化转基因结构。