School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab023.
With the development of next-generation sequencing technology, a large number of transcripts need to be analyzed, and it has been a challenge to distinguish non-coding ribonucleic acid (RNAs) (ncRNAs) from coding RNAs. And for non-model organisms, due to the lack of transcriptional data, many existing methods cannot identify them. Therefore, in addition to using deoxyribonucleic acid-based and RNA-based features, we also proposed a hybrid framework based on the stacking strategy to identify ncRNAs, and we innovatively added eight features based on predicted peptides. The proposed framework was based on stacking two-layer classifier which combined random forest (RF), LightGBM, XGBoost and logistic regression (LR) models. We used this framework to build two types of models. For cross-species ncRNAs identification model, we tested it on six different species: human, mouse, zebrafish, fruit fly, worm and Arabidopsis. Compared with other tools, our model was the best in datasets of Arabidopsis, worm and zebrafish with the accuracy of 98.36%, 99.65% and 94.12%. For performance metrics analysis, the datasets of the six species were considered as a whole set, and the sensitivity, accuracy, precision and F1 values of our model were the best. For the plant-specific ncRNAs identification model, the average values of the six metrics of the two experiments were all greater than 95%, which demonstrated it can be used to identify ncRNAs in plants. The above indicates that the hybrid framework we designed is universal between animals and plants and has significant advantages in the identification of cross-species ncRNAs.
随着下一代测序技术的发展,需要分析大量的转录本,区分非编码核糖核酸(ncRNAs)和编码 RNA 一直是一个挑战。对于非模式生物,由于缺乏转录数据,许多现有方法无法识别它们。因此,除了使用基于脱氧核糖核酸和基于 RNA 的特征外,我们还提出了一种基于堆叠策略的混合框架来识别 ncRNAs,并创新性地添加了基于预测肽的八个特征。所提出的框架基于堆叠两层分类器,结合了随机森林(RF)、LightGBM、XGBoost 和逻辑回归(LR)模型。我们使用该框架构建了两种类型的模型。对于跨物种 ncRNAs 识别模型,我们在六个不同物种上进行了测试:人类、小鼠、斑马鱼、果蝇、线虫和拟南芥。与其他工具相比,我们的模型在拟南芥、线虫和斑马鱼的数据集上表现最好,准确率分别为 98.36%、99.65%和 94.12%。对于性能指标分析,将六个物种的数据集视为一个整体,我们的模型的敏感性、准确性、精度和 F1 值是最好的。对于植物特异性 ncRNAs 识别模型,两个实验的六个指标的平均值均大于 95%,这表明它可以用于识别植物中的 ncRNAs。上述结果表明,我们设计的混合框架在动物和植物之间具有通用性,并且在跨物种 ncRNAs 的识别方面具有显著优势。