Bhattacharyya Malay, Feuerbach Lars, Bhadra Tapas, Lengauer Thomas, Bandyopadhyay Sanghamitra
Indian Statistical Institute, Kolkata.
Stat Appl Genet Mol Biol. 2012 Jan 6;11(1):Article 6. doi: 10.2202/1544-6115.1743.
MicroRNAs (miRNAs) are non-coding, short (21-23nt) regulators of protein-coding genes that are generally transcribed first into primary miRNA (pri-miR), followed by the generation of precursor miRNA (pre-miR). This finally leads to the production of the mature miRNA. A large amount of information is available on the pre- and mature miRNAs. However, very little is known about the pri-miRs, due to a lack of knowledge about their transcription start sites (TSSs). Based on the genomic loci, miRNAs can be categorized into two types --intragenic (intra-miR) and intergenic (inter-miR). While it is already an established fact that intra-miRs are commonly transcribed in conjunction with their host genes, the transcription machinery of inter-miRs is poorly understood. Although it is assumed that miRNA promoters are similar in structure to gene promoters, since both are transcribed by RNA polymerase II (Pol II), computational validations exhibit poor performance of gene promoter prediction methods on miRNAs. In this paper, we concentrate on the problem of TSS prediction for miRNAs. The present study begins with the identification of positive and negative promoter samples from recently published data stemming from RNA-sequencing studies. From these samples of experimentally validated miRNA TSSs, a number of standard sequence features are extracted. Furthermore, to account for potential footprints related to promoter regulation by CpG dinucleotide targeted DNA methylation, a number of novel features are defined. We develop a support vector machine (SVM) with RBF kernel for the prediction of miRNA TSSs trained on human miRNA promoters. A novel feature reduction technique based on archived multi-objective simulated annealing (AMOSA) identifies the final set of features. The resulting model trained on miRNA promoters shows improved performance over the one trained on protein-coding gene promoters in terms of classification accuracy, sensitivity and specificity. Results are also reported for a completely independent biologically validated test set. In a part of the investigation, the proposed approach is used to predict protein-coding gene TSSs. It shows a significantly improved performance when compared to previously published gene TSS prediction methods.
微小RNA(miRNA)是蛋白质编码基因的非编码短(21 - 23个核苷酸)调节因子,通常首先转录为初级miRNA(pri - miR),随后生成前体miRNA(pre - miR)。这最终导致成熟miRNA的产生。关于前体miRNA和成熟miRNA已有大量信息。然而,由于对其转录起始位点(TSS)缺乏了解,关于初级miRNA的信息知之甚少。基于基因组位点,miRNA可分为两类——基因内(intra - miR)和基因间(inter - miR)。虽然基因内miRNA通常与其宿主基因一起转录这一事实已得到确认,但基因间miRNA的转录机制却知之甚少。尽管假定miRNA启动子在结构上与基因启动子相似,因为二者均由RNA聚合酶II(Pol II)转录,但计算验证表明基因启动子预测方法在miRNA上的性能较差。在本文中,我们专注于miRNA的TSS预测问题。本研究首先从最近发表的RNA测序研究数据中识别正、负启动子样本。从这些经过实验验证的miRNA TSS样本中,提取了许多标准序列特征。此外,为了考虑与CpG二核苷酸靶向DNA甲基化对启动子调控相关的潜在足迹,定义了一些新特征。我们开发了一种带有RBF核的支持向量机(SVM),用于预测在人类miRNA启动子上训练的miRNA TSS。一种基于存档多目标模拟退火(AMOSA)的新特征约简技术确定了最终的特征集。在miRNA启动子上训练得到的最终模型在分类准确性、敏感性和特异性方面表现优于在蛋白质编码基因启动子上训练的模型。还报告了针对一个完全独立的经过生物学验证的测试集的结果。在部分研究中,所提出的方法用于预测蛋白质编码基因TSS。与先前发表的基因TSS预测方法相比,它表现出显著提高的性能。