在盘基网柄菌富含AT的基因组中从头搜索非编码RNA基因：马尔可夫依赖基因组特征评分的性能

De novo search for non-coding RNA genes in the AT-rich genome of Dictyostelium discoideum: performance of Markov-dependent genome feature scoring.

作者信息

Larsson Pontus, Hinas Andrea, Ardell David H, Kirsebom Leif A, Virtanen Anders, Söderbom Fredrik

机构信息

Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, SE-75124 Uppsala, Sweden.

出版信息

Genome Res. 2008 Jun;18(6):888-99. doi: 10.1101/gr.069104.107. Epub 2008 Mar 17.

DOI:10.1101/gr.069104.107

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2413156/

Abstract

Genome data are increasingly important in the computational identification of novel regulatory non-coding RNAs (ncRNAs). However, most ncRNA gene-finders are either specialized to well-characterized ncRNA gene families or require comparisons of closely related genomes. We developed a method for de novo screening for ncRNA genes with a nucleotide composition that stands out against the background genome based on a partial sum process. We compared the performance when assuming independent and first-order Markov-dependent nucleotides, respectively, and used Karlin-Altschul and Karlin-Dembo statistics to evaluate the significance of hits. We hypothesized that a first-order Markov-dependent process might have better power to detect ncRNA genes since nearest-neighbor models have been shown to be successful in predicting RNA structures. A model based on a first-order partial sum process (analyzing overlapping dinucleotides) had better sensitivity and specificity than a zeroth-order model when applied to the AT-rich genome of the amoeba Dictyostelium discoideum. In this genome, we detected 94% of previously known ncRNA genes (at this sensitivity, the false positive rate was estimated to be 25% in a simulated background). The predictions were further refined by clustering candidate genes according to sequence similarity and/or searching for an ncRNA-associated upstream element. We experimentally verified six out of 10 tested ncRNA gene predictions. We conclude that higher-order models, in combination with other information, are useful for identification of novel ncRNA gene families in single-genome analysis of D. discoideum. Our generalizable approach extends the range of genomic data that can be searched for novel ncRNA genes using well-grounded statistical methods.

摘要

基因组数据在新型调控非编码RNA（ncRNA）的计算识别中越来越重要。然而，大多数ncRNA基因发现工具要么专门针对特征明确的ncRNA基因家族，要么需要比较密切相关的基因组。我们基于部分和过程开发了一种从头筛选ncRNA基因的方法，该方法所筛选的ncRNA基因的核苷酸组成在背景基因组中很突出。我们分别比较了假设核苷酸独立和一阶马尔可夫依赖时的性能，并使用卡林 - 阿尔茨舒尔和卡林 - 登博统计量来评估命中的显著性。我们假设一阶马尔可夫依赖过程可能具有更强的检测ncRNA基因的能力，因为最近邻模型已被证明在预测RNA结构方面很成功。当应用于变形虫盘基网柄菌富含AT的基因组时，基于一阶部分和过程（分析重叠二核苷酸）的模型比零阶模型具有更好的敏感性和特异性。在这个基因组中，我们检测到了94%的先前已知的ncRNA基因（在此敏感性下，在模拟背景中估计假阳性率为25%）。通过根据序列相似性对候选基因进行聚类和/或搜索与ncRNA相关的上游元件，对预测结果进行了进一步优化。我们通过实验验证了10个测试的ncRNA基因预测中的6个。我们得出结论，高阶模型与其他信息相结合，对于在盘基网柄菌的单基因组分析中识别新型ncRNA基因家族很有用。我们的通用方法扩展了可以使用有充分依据的统计方法搜索新型ncRNA基因的基因组数据范围。