Cummings Michael P, Myers Daniel S
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742-3360, USA.
BMC Bioinformatics. 2004 Sep 16;5:132. doi: 10.1186/1471-2105-5-132.
RNA editing is the process whereby an RNA sequence is modified from the sequence of the corresponding DNA template. In the mitochondria of land plants, some cytidines are converted to uridines before translation. Despite substantial study, the molecular biological mechanism by which C-to-U RNA editing proceeds remains relatively obscure, although several experimental studies have implicated a role for cis-recognition. A highly non-random distribution of nucleotides is observed in the immediate vicinity of edited sites (within 20 nucleotides 5' and 3'), but no precise consensus motif has been identified.
Data for analysis were derived from the the complete mitochondrial genomes of Arabidopsis thaliana, Brassica napus, and Oryza sativa; additionally, a combined data set of observations across all three genomes was generated. We selected datasets based on the 20 nucleotides 5' and the 20 nucleotides 3' of edited sites and an equivalently sized and appropriately constructed null-set of non-edited sites. We used tree-based statistical methods and random forests to generate models of C-to-U RNA editing based on the nucleotides surrounding the edited/non-edited sites and on the estimated folding energies of those regions. Tree-based statistical methods based on primary sequence data surrounding edited/non-edited sites and estimates of free energy of folding yield models with optimistic re-substitution-based estimates of approximately 0.71 accuracy, approximately 0.64 sensitivity, and approximately 0.88 specificity. Random forest analysis yielded better models and more exact performance estimates with approximately 0.74 accuracy, approximately 0.72 sensitivity, and approximately 0.81 specificity for the combined observations.
Simple models do moderately well in predicting which cytidines will be edited to uridines, and provide the first quantitative predictive models for RNA edited sites in plant mitochondria. Our analysis shows that the identity of the nucleotide -1 to the edited C and the estimated free energy of folding for a 41 nt region surrounding the edited C are the most important variables that distinguish most edited from non-edited sites. However, the results suggest that primary sequence data and simple free energy of folding calculations alone are insufficient to make highly accurate predictions.
RNA编辑是指RNA序列从相应DNA模板序列发生改变的过程。在陆地植物的线粒体中,一些胞嘧啶在翻译前会转变为尿嘧啶。尽管已有大量研究,但C到U的RNA编辑过程的分子生物学机制仍相对模糊,不过一些实验研究表明顺式识别发挥了作用。在编辑位点的紧邻区域(5'和3'方向20个核苷酸范围内)观察到核苷酸的高度非随机分布,但尚未确定精确的共有基序。
分析数据来源于拟南芥、甘蓝型油菜和水稻的完整线粒体基因组;此外,还生成了一个涵盖所有三个基因组的综合观察数据集。我们基于编辑位点5'方向的20个核苷酸和3'方向的20个核苷酸以及同等大小且构建适当的未编辑位点空集来选择数据集。我们使用基于树的统计方法和随机森林,根据编辑/未编辑位点周围的核苷酸以及这些区域的估计折叠能来生成C到U RNA编辑的模型。基于编辑/未编辑位点周围的一级序列数据和折叠自由能估计的基于树的统计方法产生的模型,基于重新代入法的乐观估计准确率约为0.71,灵敏度约为0.64,特异性约为0.88。随机森林分析产生了更好的模型和更精确的性能估计,综合观察的准确率约为0.74,灵敏度约为0.72,特异性约为0.81。
简单模型在预测哪些胞嘧啶会被编辑为尿嘧啶方面表现尚可,并为植物线粒体中的RNA编辑位点提供了首个定量预测模型。我们的分析表明,编辑的C的上游第1个核苷酸的身份以及围绕编辑的C的41 nt区域的估计折叠自由能是区分大多数编辑位点和未编辑位点的最重要变量。然而,结果表明仅靠一级序列数据和简单的折叠自由能计算不足以进行高度准确的预测。