Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA.
TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX, USA.
Bioinformatics. 2019 Apr 1;35(7):1133-1141. doi: 10.1093/bioinformatics/bty765.
Non-coding RNAs (ncRNAs) are known to play crucial roles in various biological processes, and there is a pressing need for accurate computational detection methods that could be used to efficiently scan genomes to detect novel ncRNAs. However, unlike coding genes, ncRNAs often lack distinctive sequence features that could be used for recognizing them. Although many ncRNAs are known to have a well conserved secondary structure, which provides useful cues for computational prediction, it has been also shown that a structure-based approach alone may not be sufficient for detecting ncRNAs in a single sequence. Currently, the most effective ncRNA detection methods combine structure-based techniques with a comparative genome analysis approach to improve the prediction performance.
In this paper, we propose RNAdetect, a computational method incorporating novel features for accurate detection of ncRNAs in combination with comparative genome analysis. Given a sequence alignment, RNAdetect can accurately detect the presence of functional ncRNAs by incorporating novel predictive features based on the concept of generalized ensemble defect (GED), which assesses the degree of structure conservation across multiple related sequences and the conformation of the individual folding structures to a common consensus structure. Furthermore, n-gram models (NGMs) are used to extract features that can effectively capture sequence homology to known ncRNA families. Utilization of NGMs can enhance the detection of ncRNAs that have sparse folding structures with many unpaired bases. Extensive performance evaluation based on the Rfam database and bacterial genomes demonstrate that RNAdetect can accurately and reliably detect novel ncRNAs, outperforming the current state-of-the-art methods.
The source code for RNAdetect and the benchmark data used in this paper can be downloaded at https://github.com/bjyoontamu/RNAdetect.
非编码 RNA(ncRNA)在各种生物过程中起着至关重要的作用,因此迫切需要准确的计算检测方法,以便有效地扫描基因组以检测新的 ncRNA。然而,与编码基因不同,ncRNA 通常缺乏可用于识别它们的独特序列特征。尽管许多 ncRNA 已知具有保守的二级结构,这为计算预测提供了有用的线索,但也表明仅基于结构的方法可能不足以在单个序列中检测 ncRNA。目前,最有效的 ncRNA 检测方法将基于结构的技术与比较基因组分析方法相结合,以提高预测性能。
在本文中,我们提出了 RNAdetect,这是一种结合了比较基因组分析的计算方法,用于准确检测 ncRNA。给定序列比对,RNAdetect 可以通过结合基于广义集合缺陷(GED)概念的新预测特征来准确检测功能 ncRNA 的存在,该概念评估了多个相关序列之间结构保守程度以及单个折叠结构与常见共识结构的构象。此外,n 元模型(NGM)用于提取可有效捕获已知 ncRNA 家族序列同源性的特征。利用 NGM 可以增强对具有许多未配对碱基的稀疏折叠结构的 ncRNA 的检测。基于 Rfam 数据库和细菌基因组的广泛性能评估表明,RNAdetect 可以准确可靠地检测新的 ncRNA,优于当前最先进的方法。
RNAdetect 的源代码和本文中使用的基准数据可在 https://github.com/bjyoontamu/RNAdetect 下载。