Turi Antonio, Loglisci Corrado, Salvemini Eliana, Grillo Giorgio, Malerba Donato, D'Elia Domenica
Department of Computer Science, University of Bari, Via Orabona 4, 70125 Bari, Italy.
BMC Bioinformatics. 2009 Jun 16;10 Suppl 6(Suppl 6):S25. doi: 10.1186/1471-2105-10-S6-S25.
Many studies report about detection and functional characterization of cis-regulatory motifs in untranslated regions (UTRs) of mRNAs but little is known about the nature and functional role of their distribution. To address this issue we have developed a computational approach based on the use of data mining techniques. The idea is that of mining frequent combinations of translation regulatory motifs, since their significant co-occurrences could reveal functional relationships important for the post-transcriptional control of gene expression. The experimentation has been focused on targeted mitochondrial transcripts to elucidate the role of translational control in mitochondrial biogenesis and function.
The analysis is based on a two-stepped procedure using a sequential pattern mining algorithm. The first step searches for frequent patterns (FPs) of motifs without taking into account their spatial displacement. In the second step, frequent sequential patterns (FSPs) of spaced motifs are generated by taking into account the conservation of spacers between each ordered pair of co-occurring motifs. The algorithm makes no assumption on the relation among motifs and on the number of motifs involved in a pattern. Different FSPs can be found depending on different combinations of two parameters, i.e. the threshold of the minimum percentage of sequences supporting the pattern, and the granularity of spacer discretization. Results can be retrieved at the UTRminer web site: http://utrminer.ba.itb.cnr.it/. The discovered FPs of motifs amount to 216 in the overall dataset and to 140 in the human subset. For each FP, the system provides information on the discovered FSPs, if any. A variety of search options help users in browsing the web resource. The list of sequence IDs supporting each pattern can be used for the retrieval of information from the UTRminer database.
Computational prediction of structural properties of regulatory sequences is not trivial. The presented data mining approach is able to overcome some limits observed in other competitive tools. Preliminary results on UTR sequences from nuclear transcripts targeting mitochondria are promising and lead us to be confident on the effectiveness of the approach for future developments.
许多研究报道了信使核糖核酸(mRNA)非翻译区(UTR)中顺式调控基序的检测及功能特征,但对其分布的性质和功能作用却知之甚少。为解决这一问题,我们开发了一种基于数据挖掘技术的计算方法。其思路是挖掘翻译调控基序的频繁组合,因为它们的显著共现可能揭示对基因表达转录后控制至关重要的功能关系。实验聚焦于靶向线粒体转录本,以阐明翻译控制在线粒体生物发生和功能中的作用。
该分析基于使用序列模式挖掘算法的两步程序。第一步搜索基序的频繁模式(FP),而不考虑它们的空间位移。第二步,通过考虑每个共现基序有序对之间间隔区的保守性,生成间隔基序的频繁序列模式(FSP)。该算法对基序之间的关系以及模式中涉及的基序数量不做任何假设。根据两个参数的不同组合,即支持该模式的序列最小百分比阈值和间隔区离散化的粒度,可以找到不同的FSP。结果可在UTRminer网站(http://utrminer.ba.itb.cnr.it/)上获取。在整个数据集中发现的基序FP数量为216个,在人类子集中为140个。对于每个FP,系统会提供有关发现的FSP的信息(如果有的话)。各种搜索选项帮助用户浏览该网络资源。支持每个模式的序列ID列表可用于从UTRminer数据库中检索信息。
调控序列结构特性的计算预测并非易事。所提出的数据挖掘方法能够克服其他竞争工具中观察到的一些局限性。针对靶向线粒体的核转录本UTR序列的初步结果很有前景,这使我们对该方法未来发展的有效性充满信心。