Toyoda Tetsuro, Shinozaki Kazuo
Phenome Informatics Team, Functional Genomics Research Group, Genomic Sciences Center, Japan.
Plant J. 2005 Aug;43(4):611-21. doi: 10.1111/j.1365-313X.2005.02470.x.
Tiling arrays of high-density oligonucleotide probes spanning the entire genome are powerful tools for the discovery of new genes. However, it is difficult to determine the structure of the spliced product of a structurally unknown gene from noisy array signals only. Here we introduce a statistical method that estimates the precise splicing points and the exon/intron structure of a structurally unknown gene by maximizing the odds or the ratio of posterior probabilities of the structure under the observation of array signal intensities and nucleic acid sequences. Our method more accurately predicted the gene structures than the simple threshold-based method, and more correctly estimated the expression values of structurally unknown genes than the window-based method. It was observed that the Markov model contributed to the precision of splice points, and that the statistical significance of expression (P-value) represented the reliability of the estimated gene structure and expression value well. We have implemented the method as a program ARTADE (ARabidopsis Tiling Array-based Detection of Exons) and applied it to the Arabidopsis thaliana whole-genome array data analysis. The database of the predicted results and the ARTADE program are available at http://omicspace.riken.jp/ARTADE/.
覆盖整个基因组的高密度寡核苷酸探针平铺阵列是发现新基因的强大工具。然而,仅从噪声较大的阵列信号中确定结构未知基因的剪接产物结构是困难的。在此,我们引入一种统计方法,该方法通过在观察阵列信号强度和核酸序列的情况下最大化结构的后验概率的比值或几率,来估计结构未知基因的精确剪接位点和外显子/内含子结构。我们的方法比基于简单阈值的方法更准确地预测基因结构,并且比基于窗口的方法更正确地估计结构未知基因的表达值。据观察,马尔可夫模型有助于提高剪接位点的精度,并且表达的统计显著性(P值)很好地代表了估计的基因结构和表达值的可靠性。我们已将该方法实现为一个程序ARTADE(基于拟南芥平铺阵列的外显子检测),并将其应用于拟南芥全基因组阵列数据分析。预测结果数据库和ARTADE程序可在http://omicspace.riken.jp/ARTADE/获取。