Suppr超能文献

R5hmCFDV:基于深度特征融合和深度投票的 RNA 5-羟甲基胞嘧啶计算识别。

R5hmCFDV: computational identification of RNA 5-hydroxymethylcytosine based on deep feature fusion and deep voting.

机构信息

School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China.

出版信息

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac341.

Abstract

RNA 5-hydroxymethylcytosine (5hmC) is a kind of RNA modification, which is related to the life activities of many organisms. Studying its distribution is very important to reveal its biological function. Previously, high-throughput sequencing was used to identify 5hmC, but it is expensive and inefficient. Therefore, machine learning is used to identify 5hmC sites. Here, we design a model called R5hmCFDV, which is mainly divided into feature representation, feature fusion and classification. (i) Pseudo dinucleotide composition, dinucleotide binary profile and frequency, natural vector and physicochemical property are used to extract features from four aspects: nucleotide composition, coding, natural language and physical and chemical properties. (ii) To strengthen the relevance of features, we construct a novel feature fusion method. Firstly, the attention mechanism is employed to process four single features, stitch them together and feed them to the convolution layer. After that, the output data are processed by BiGRU and BiLSTM, respectively. Finally, the features of these two parts are fused by the multiply function. (iii) We design the deep voting algorithm for classification by imitating the soft voting mechanism in the Python package. The base classifiers contain deep neural network (DNN), convolutional neural network (CNN) and improved gated recurrent unit (GRU). And then using the principle of soft voting, the corresponding weights are assigned to the predicted probabilities of the three classifiers. The predicted probability values are multiplied by the corresponding weights and then summed to obtain the final prediction results. We use 10-fold cross-validation to evaluate the model, and the evaluation indicators are significantly improved. The prediction accuracy of the two datasets is as high as 95.41% and 93.50%, respectively. It demonstrates the stronger competitiveness and generalization performance of our model. In addition, all datasets and source codes can be found at https://github.com/HongyanShi026/R5hmCFDV.

摘要

RNA 5-羟甲基胞嘧啶 (5hmC) 是一种 RNA 修饰,与许多生物的生命活动有关。研究其分布对于揭示其生物学功能非常重要。以前,高通量测序被用于鉴定 5hmC,但它昂贵且效率低下。因此,机器学习被用于鉴定 5hmC 位点。在这里,我们设计了一个名为 R5hmCFDV 的模型,主要分为特征表示、特征融合和分类。(i) 伪二核苷酸组成、二核苷酸二进制谱和频率、自然向量和理化性质用于从核苷酸组成、编码、自然语言和理化性质四个方面提取特征。(ii) 为了增强特征的相关性,我们构建了一种新的特征融合方法。首先,利用注意力机制对四个单特征进行处理,将它们拼接在一起,输入到卷积层中。之后,分别对输出数据进行 BiGRU 和 BiLSTM 处理。最后,通过乘法函数融合这两部分的特征。(iii) 我们通过模仿 Python 包中的软投票机制,设计了用于分类的深度投票算法。基础分类器包括深度神经网络(DNN)、卷积神经网络(CNN)和改进的门控循环单元(GRU)。然后,利用软投票的原理,为三个分类器的预测概率分配相应的权重。将预测概率值乘以相应的权重,然后求和,得到最终的预测结果。我们使用 10 折交叉验证来评估模型,评估指标有显著提高。两个数据集的预测准确率分别高达 95.41%和 93.50%。这表明了我们模型更强的竞争力和泛化性能。此外,所有数据集和源代码都可以在 https://github.com/HongyanShi026/R5hmCFDV 上找到。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验