College of Mathematics and Computer Science, Fuzhou University, China.
J Theor Biol. 2012 Mar 7;296:95-102. doi: 10.1016/j.jtbi.2011.12.002. Epub 2011 Dec 8.
Recently, for identifying protein coding regions in new sequences from unknown organisms without training sets, a Self Adaptive Spectral Rotation (SASR) method has been developed to visualize the Triplet Periodicity (TP) property, which is a simple and universal coding related property. The rough locations of coding regions can be visually revealed by the SASR method, without any training. However, the method does not numerically discriminate the locations of coding regions. Based on the SASR method, we develop a new approach, named the T-Z-T analysis, to provide numerical results of coding region prediction. This approach adopts a t-test segmentation to separate coding and non-coding regions in the SASR's output and further uses a z-test filter to recognize region patterns. After that, another t-test segmentation is conducted to break down adjacent coding regions by detecting the frame shifts. Since it is based on the graphic output of the SASR, this approach does not require any training. Meanwhile, this approach is more stable, because it is not sensitive to errors in the input DNA sequence. Such advantages make it suitable for coding region prediction in the early stage, when there is insufficient training set, and even the input data are inaccurate.
最近,为了在没有训练集的情况下识别未知生物新序列中的蛋白质编码区域,开发了一种自适谱旋转(SASR)方法来可视化三联体周期性(TP)特性,这是一种简单而普遍的编码相关特性。SASR 方法可以在没有任何训练的情况下直观地揭示编码区域的大致位置。但是,该方法不能数值区分编码区域的位置。基于 SASR 方法,我们开发了一种新方法,称为 T-Z-T 分析,以提供编码区域预测的数值结果。该方法采用 t 检验分割将编码区和非编码区在 SASR 的输出中分离出来,然后使用 z 检验滤波器识别区域模式。之后,通过检测框架移位,再次进行 t 检验分割,将相邻的编码区域分开。由于它是基于 SASR 的图形输出,因此不需要任何训练。同时,该方法更稳定,因为它对输入 DNA 序列中的错误不敏感。这些优点使其适用于编码区域预测的早期阶段,当训练集不足甚至输入数据不准确时。