Yoon Kihoon, Ko Daijin, Doderer Mark, Livi Carolina B, Penalva Luiz O F
Department of Epidemiology and Biostatistics, The University of Texas Health Science Center at San Antonio, San Antonio, Texas 78229-3900, USA.
RNA Biol. 2008 Oct-Dec;5(4):255-62. doi: 10.4161/rna.7116. Epub 2008 Oct 3.
Eukaryotic gene expression must be coordinated for the proper functioning of biological processes. This coordination can be achieved both at the transcriptional and post-transcriptional levels. In both cases, regulatory sequences placed at either promoter regions or on UTRs function as markers recognized by regulators that can then activate or repress different groups of genes according to necessity. While regulatory sequences involved in transcription are quite well documented, there is a lack of information on sequence elements involved in post-transcriptional regulation. We used a statistical over-representation method to identify novel regulatory elements located on UTRs. An exhaustive search approach was used to calculate the frequency of all possible n-mers (short nucleotide sequences) in 16,160 human genes of NCBI RefSeq sequences and to identify any peculiar usage of n-mers on UTRs. After a stringent filtering process, we identified 2,772 highly over-represented n-mers on 3' UTRs. We provide evidence that these n-mers are potentially involved in regulatory functions. Identified n-mers overlap with previously identified binding sites for HuR and TIA-1 and, ARE and GRE sequences. We determine also that n-mers overlap with predicted miRNA target sites. Finally, a method to cluster n-mer groups allowed the identification of putative gene networks.
真核基因表达必须进行协调,以确保生物过程的正常运作。这种协调可以在转录水平和转录后水平上实现。在这两种情况下,位于启动子区域或非翻译区(UTR)的调控序列作为调控因子识别的标记,调控因子随后可根据需要激活或抑制不同的基因群。虽然参与转录的调控序列已有充分记录,但关于参与转录后调控的序列元件的信息却很缺乏。我们使用统计过表达方法来识别位于非翻译区的新型调控元件。采用穷举搜索方法计算NCBI RefSeq序列中16160个人类基因中所有可能的n聚体(短核苷酸序列)的频率,并识别非翻译区n聚体的任何特殊用法。经过严格的筛选过程,我们在3'非翻译区识别出2772个高度过表达的n聚体。我们提供证据表明这些n聚体可能参与调控功能。已识别的n聚体与先前识别的HuR和TIA-1的结合位点以及ARE和GRE序列重叠。我们还确定n聚体与预测的miRNA靶位点重叠。最后,一种对n聚体组进行聚类的方法使得能够识别推定的基因网络。