Department of Computer Science, University of Brasilia, ICC Central, Instituto de Ciências Exatas, Campus Universitario Darcy Ribeiro, Asa Norte, CEP: 70910-900, Brasilia, Brazil.
Gerência Regional de Brasilia (GEREB), Oswaldo Cruz Foundation (Fiocruz), Av. L3 Norte, Campus Universitário Darcy Ribeiro, Gleba A, Asa Norte, CEP: 70910-900, Brasília, Brazil.
BMC Genomics. 2017 Oct 18;18(1):804. doi: 10.1186/s12864-017-4178-4.
In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts.
The proposed method is based on SVM and uses the first ORF relative length and frequencies of nucleotide patterns selected by PCA as features. FASTA files were used as input to calculate all possible features. These features were divided in two sets: (i) 336 frequencies of nucleotide patterns; and (ii) 4 features derived from ORFs. PCA were applied to the first set to identify 6 groups of frequencies that could most contribute to the distinction. Twenty-four experiments using the 6 groups from the first set and the features from the second set where built to create the best model to distinguish lncRNAs from PCTs.
This method was trained and tested with human (Homo sapiens), mouse (Mus musculus) and zebrafish (Danio rerio) data, achieving 98.21%, 98.03% and 96.09%, accuracy, respectively. Our method was compared to other tools available in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc), and showed an improvement in accuracy by ≈3.00%. In addition, to validate our model, the mouse data was classified with the human model, and vice-versa, achieving ≈97.80% accuracy in both cases, showing that the model is not overfit. The SVM models were validated with data from rat (Rattus norvegicus), pig (Sus scrofa) and fruit fly (Drosophila melanogaster), and obtained more than 84.00% accuracy in all these organisms. Our results also showed that 81.2% of human pseudogenes and 91.7% of mouse pseudogenes were classified as non-coding. Moreover, our method was capable of re-annotating two uncharacterized sequences of Swiss-Prot database with high probability of being lncRNAs. Finally, in order to use the method to annotate transcripts derived from RNA-seq, previously identified lncRNAs of human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta) were analyzed, having successfully classified 98.62%, 80.8% and 91.9%, respectively.
The SVM method proposed in this work presents high performance to distinguish lncRNAs from PCTs, as shown in the results. To build the model, besides using features known in the literature regarding ORFs, we used PCA to identify features among nucleotide pattern frequencies that contribute the most in distinguishing lncRNAs from PCTs, in reference data sets. Interestingly, models created with two evolutionary distant species could distinguish lncRNAs of even more distant species.
近年来,全球数千个测序项目生成了大量的 RNA 转录本,产生了大量有待分析的转录本数据。在分析这些数据时,一个重要的问题是区分长非编码 RNA(lncRNA)和蛋白编码转录本(PCT)。因此,我们提出了一种基于支持向量机(SVM)的方法,用于区分 lncRNA 和 PCT,使用基于核苷酸模式频率和 ORF 长度的特征。
该方法基于 SVM,并使用 PCA 选择的第一个 ORF 相对长度和核苷酸模式频率作为特征。FASTA 文件被用作输入来计算所有可能的特征。这些特征分为两组:(i)336 个核苷酸模式频率;(ii)从 ORFs 中导出的 4 个特征。PCA 应用于第一组,以确定最有助于区分的 6 组频率。使用第一组的 6 组和第二组的特征进行了 24 次实验,以创建最佳模型来区分 lncRNA 和 PCT。
该方法分别在人类(Homo sapiens)、小鼠(Mus musculus)和斑马鱼(Danio rerio)数据上进行了训练和测试,分别达到了 98.21%、98.03%和 96.09%的准确率。我们的方法与文献中其他可用的工具(CPAT、CPC、iSeeRNA、lncRNApred、lncRScan-SVM 和 FEELnc)进行了比较,准确率提高了约 3.00%。此外,为了验证我们的模型,使用人类模型对小鼠数据进行了分类,反之亦然,在两种情况下都达到了约 97.80%的准确率,表明该模型没有过拟合。SVM 模型使用来自大鼠(Rattus norvegicus)、猪(Sus scrofa)和果蝇(Drosophila melanogaster)的数据进行了验证,在所有这些生物中都获得了超过 84.00%的准确率。我们的结果还表明,81.2%的人类假基因和 91.7%的小鼠假基因被分类为非编码。此外,我们的方法能够以高概率重新注释 Swiss-Prot 数据库中两个未被描述的序列为 lncRNA。最后,为了使用该方法对 RNA-seq 衍生的转录本进行注释,分析了先前鉴定的人类、大猩猩(Gorilla gorilla)和恒河猴(Macaca mulatta)的 lncRNA,成功地分别分类了 98.62%、80.8%和 91.9%。
本文提出的 SVM 方法在区分 lncRNA 和 PCT 方面表现出了较高的性能,结果表明了这一点。在构建模型时,除了使用文献中关于 ORF 的特征外,我们还使用 PCA 来识别核苷酸模式频率中最有助于区分 lncRNA 和 PCT 的特征,这些特征来自参考数据集。有趣的是,使用两个进化上相距较远的物种创建的模型甚至可以区分更远物种的 lncRNA。