Larsson Pontus, Hinas Andrea, Ardell David H, Kirsebom Leif A, Virtanen Anders, Söderbom Fredrik
Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, SE-75124 Uppsala, Sweden.
Genome Res. 2008 Jun;18(6):888-99. doi: 10.1101/gr.069104.107. Epub 2008 Mar 17.
Genome data are increasingly important in the computational identification of novel regulatory non-coding RNAs (ncRNAs). However, most ncRNA gene-finders are either specialized to well-characterized ncRNA gene families or require comparisons of closely related genomes. We developed a method for de novo screening for ncRNA genes with a nucleotide composition that stands out against the background genome based on a partial sum process. We compared the performance when assuming independent and first-order Markov-dependent nucleotides, respectively, and used Karlin-Altschul and Karlin-Dembo statistics to evaluate the significance of hits. We hypothesized that a first-order Markov-dependent process might have better power to detect ncRNA genes since nearest-neighbor models have been shown to be successful in predicting RNA structures. A model based on a first-order partial sum process (analyzing overlapping dinucleotides) had better sensitivity and specificity than a zeroth-order model when applied to the AT-rich genome of the amoeba Dictyostelium discoideum. In this genome, we detected 94% of previously known ncRNA genes (at this sensitivity, the false positive rate was estimated to be 25% in a simulated background). The predictions were further refined by clustering candidate genes according to sequence similarity and/or searching for an ncRNA-associated upstream element. We experimentally verified six out of 10 tested ncRNA gene predictions. We conclude that higher-order models, in combination with other information, are useful for identification of novel ncRNA gene families in single-genome analysis of D. discoideum. Our generalizable approach extends the range of genomic data that can be searched for novel ncRNA genes using well-grounded statistical methods.
基因组数据在新型调控非编码RNA(ncRNA)的计算识别中越来越重要。然而,大多数ncRNA基因发现工具要么专门针对特征明确的ncRNA基因家族,要么需要比较密切相关的基因组。我们基于部分和过程开发了一种从头筛选ncRNA基因的方法,该方法所筛选的ncRNA基因的核苷酸组成在背景基因组中很突出。我们分别比较了假设核苷酸独立和一阶马尔可夫依赖时的性能,并使用卡林 - 阿尔茨舒尔和卡林 - 登博统计量来评估命中的显著性。我们假设一阶马尔可夫依赖过程可能具有更强的检测ncRNA基因的能力,因为最近邻模型已被证明在预测RNA结构方面很成功。当应用于变形虫盘基网柄菌富含AT的基因组时,基于一阶部分和过程(分析重叠二核苷酸)的模型比零阶模型具有更好的敏感性和特异性。在这个基因组中,我们检测到了94%的先前已知的ncRNA基因(在此敏感性下,在模拟背景中估计假阳性率为25%)。通过根据序列相似性对候选基因进行聚类和/或搜索与ncRNA相关的上游元件,对预测结果进行了进一步优化。我们通过实验验证了10个测试的ncRNA基因预测中的6个。我们得出结论,高阶模型与其他信息相结合,对于在盘基网柄菌的单基因组分析中识别新型ncRNA基因家族很有用。我们的通用方法扩展了可以使用有充分依据的统计方法搜索新型ncRNA基因的基因组数据范围。