Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY 11724, USA.
BMC Bioinformatics. 2010 Nov 8;11:550. doi: 10.1186/1471-2105-11-550.
The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.
Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.
MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
由于编码外显子包含一系列三个核苷酸密码子,这些密码子编码特定的氨基酸残基,因此 DNA 序列的编码区(编码外显子)表现出三联体周期性(TP)。这种周期性通常在内含子和基因间区中观察不到。如果将 DNA 序列分成小片段,并对每个片段应用傅里叶变换,则在编码片段的傅里叶谱中通常会观察到频率为 1/3 的强峰,但在非编码区域则不会。该特性已用于在未注释的序列中识别蛋白质编码基因的位置。该方法速度快,无需训练。然而,需要在任意大小的片段(窗口)上计算傅里叶变换会影响定位 TP 边界的准确性。在这里,我们报告了一种提供更高分辨率识别这些边界的技术,并使用该技术探索模型生物秀丽隐杆线虫基因组中 TP 区域的生物学相关性。
使用模拟的 TP 信号和真实的秀丽隐杆线虫序列 F56F11 作为示例,我们证明了:(1)改进的小波变换(MWT)比传统的短时傅里叶变换(STFT)更好地定义了 TP 区域的边界;(2)MWT 的尺度参数(a)决定了 TP 边界定位的精度:较大的 a 值给出更锐利的 TP 边界,但会导致较低的信噪比;(3)RNA 剪接位点的 TP 信号比编码区弱;(4)编码区的 TP 信号可以通过移码突变而被破坏或恢复;(5)内含子和基因间区的 6 bp 周期性会产生假阳性信号,并且可以用 6 bp MWT 去除。
MWT 可以比 STFT 提供更精确的 TP 边界,并且通过更大的尺度 MWT 可以进一步细化边界。减去 6 bp 周期性信号可减少假阳性的数量。实验引入的移码突变有助于恢复可能由古代移码引起的丢失的 TP 信号。更重要的是,TP 信号有可能用于检测完全拼接的 mRNA 序列中的剪接接头。