Ventola Giovanna M M, Noviello Teresa M R, D'Aniello Salvatore, Spagnuolo Antonietta, Ceccarelli Michele, Cerulo Luigi
Department of Science and Technology, University of Sannio, via Port'Arsa, 11, Benevento, 82100, Italy.
BioGeM, Institute of Genetic Research "Gaetano Salvatore", c.da Camporeale, Ariano Irpino (AV), 83031, Italy.
BMC Bioinformatics. 2017 Mar 23;18(1):187. doi: 10.1186/s12859-017-1594-z.
The unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.
In this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.
Understanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/ .
长链非编码RNA作为许多生物学背景下重要的基因调控因子被揭示,这增加了对高效且强大的计算方法的需求,以便从通过高通量RNA测序数据组装的转录本中识别新型长链非编码RNA。已经提出了几类基于序列的特征来区分编码和非编码转录本。其中,开放阅读框、保守得分、核苷酸排列和RNA二级结构在文献中已成功用于识别基因间长链非编码RNA,这是一类特殊的非编码RNA。
在本文中,我们对从序列数据中提取的大量特征进行了系统评估。我们使用了文献中提出的大多数特征,并将转座元件中包含的重复序列的出现情况作为一组新的特征纳入其中。目的是检测能够将长链非编码转录本与其他类别(包括蛋白质编码和非编码)区分开来的特征标记(特征组)。我们评估了不同的特征选择算法,测试了特征标记的稳定性,并使用机器学习算法评估了特征标记的预测能力。该研究揭示了人类、小鼠和斑马鱼中的不同特征标记,突出表明一些特征在物种间共享,而其他特征则倾向于物种特异性。与编码潜力工具和类似的监督方法相比,在机器学习算法中纳入新的特征标记,如本文确定的那些,根据物种和特征标记的不同,预测性能在精确率和召回率曲线下面积方面提高了1%至24%。
了解哪些特征最适合预测长链非编码RNA有助于开发更有效的自动注释流程,这对于注释不佳的基因组(如斑马鱼)尤其重要。我们提供了一个网络工具,该工具使用从fasta和gtf格式中获得的特征标记来识别新型长链非编码RNA。该工具可通过以下网址获取:http://www.bioinformatics-sannio.org/software/ 。