IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2795-2801. doi: 10.1109/TCBB.2021.3057128. Epub 2021 Dec 8.
Non-coding RNA (ncRNA) is involved in many biological processes and diseases in all species. Many ncRNA datasets exist that provide ncRNA data in FASTA format which is well suited for biomedical purposes. However, for ncRNA analysis and classification, statistical learning methods require hidden numerical features from the data. Furthermore, in the literature, a wealth of sequence intrinsic features has been proposed for ncRNA identification. The extraction of hidden features, their analysis, and usage of a suitable set of features is crucial for the performance of any statistical learning method. To alleviate the posed challenges, we generated 96 feature datasets from ncRNA widely used features. The feature datasets are based on RNACentral and consist of species, ncRNA types, and expert databases that are available on the FexRNA platform. Additionally, the feature datasets are explored and analysed to provide statistical information, univariate, and bivariate analysis. We sought to determine which of these 17 features would be most appropriate to use in developing ncRNA classification approaches. For feature selection (FS), a two-phase hierarchical FS framework based on correlation and majority voting is proposed and evaluated on 5 species. The FexRNA platform provides information about ncRNA feature analysis and selection.
非编码 RNA(ncRNA)参与了所有物种的许多生物过程和疾病。许多 ncRNA 数据集以 FASTA 格式提供 ncRNA 数据,非常适合生物医学用途。然而,对于 ncRNA 分析和分类,统计学习方法需要从数据中提取隐藏的数值特征。此外,在文献中,已经提出了大量用于 ncRNA 识别的序列固有特征。隐藏特征的提取、分析以及使用合适的特征集对于任何统计学习方法的性能都是至关重要的。为了缓解这些挑战,我们从广泛使用的 ncRNA 特征中生成了 96 个特征数据集。这些特征数据集基于 RNACentral,包括物种、ncRNA 类型以及 FexRNA 平台上可用的专家数据库。此外,还对这些特征数据集进行了探索和分析,以提供统计信息、单变量和双变量分析。我们试图确定在开发 ncRNA 分类方法时,哪些特征最适合使用。对于特征选择(FS),提出了一种基于相关性和多数投票的两阶段层次 FS 框架,并在 5 个物种上进行了评估。FexRNA 平台提供了关于 ncRNA 特征分析和选择的信息。