National Agri-Food Biotechnology Institute, Sector 81, SAS Nagar, 140306, Punjab, India.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab178.
Detection of novel transcripts with deep sequencing has increased the demand for computational algorithms as their identification and validation using in vivo techniques is time-consuming, costly and unreliable. Most of these discovered transcripts belong to non-coding RNAs, a large group known for their diverse functional roles but lacks the common taxonomy. Thus, upon the identification of the absence of coding potential in them, it is crucial to recognize their prime functional category. To address this heterogeneity issue, we divide the ncRNAs into three classes and present RNA classifier (RNAC) that categorizes the RNAs into coding, housekeeping, small non-coding and long non-coding classes. RNAC utilizes the alignment-based genomic descriptors to extract statistical, local binary patterns and histogram features and fuse them to construct the classification models with extreme gradient boosting. The experiments are performed on four species, and the performance is assessed on multiclass and conventional binary classification (coding versus no-coding) problems. The proposed approach achieved >93% accuracy on both classification problems and also outperformed other well-known existing methods in coding potential prediction. This validates the usefulness of feature fusion for improved performance on both types of classification problems. Hence, RNAC is a valuable tool for the accurate identification of multiple RNAs .
利用深度测序检测新的转录本增加了对计算算法的需求,因为使用体内技术对其进行鉴定和验证既耗时、昂贵又不可靠。这些新发现的转录本大多属于非编码 RNA,这是一组具有多种功能作用但缺乏通用分类的 RNA。因此,在确定它们缺乏编码潜力后,识别其主要功能类别至关重要。为了解决这种异质性问题,我们将 ncRNAs 分为三类,并提出了 RNA 分类器(RNAC),该分类器将 RNA 分为编码、管家、小非编码和长非编码类。RNAC 利用基于对齐的基因组描述符来提取统计、局部二值模式和直方图特征,并将它们融合在一起,使用极端梯度提升构建分类模型。在四个物种上进行了实验,并在多类和传统的二进制分类(编码与非编码)问题上评估了性能。该方法在两种分类问题上均达到了>93%的准确率,并且在编码潜力预测方面也优于其他著名的现有方法。这验证了特征融合在提高两种类型的分类问题性能方面的有效性。因此,RNAC 是一种用于准确识别多种 RNA 的有价值的工具。