Suppr超能文献

深度学习框架结合词嵌入技术识别 DNA 复制起点

A deep learning framework combined with word embedding to identify DNA replication origins.

机构信息

School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China.

出版信息

Sci Rep. 2021 Jan 12;11(1):844. doi: 10.1038/s41598-020-80670-x.

Abstract

The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote's ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, 'Word2vec', to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew's correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.

摘要

DNA 复制影响 DNA 生命周期中遗传信息的传递。由于复制起点(ORIs)的分布是精确调节复制过程的主要决定因素,因此正确识别 ORIs 对于深入了解 DNA 复制机制和遗传表达的调控机制具有重要意义。特别是对于真核生物,每个基因序列中都存在多个 ORIs,以在合理的时间内完成复制。为了简化真核生物 ORIs 的识别过程,大多数现有的方法都是由传统的机器学习算法开发的,并且针对固定长度的基因序列。因此,识别结果并不令人满意,即仍有很大的改进空间。为了突破以往研究的局限性,本文开发了序列分割方法,并采用词嵌入技术“Word2vec”将基因序列转换为词向量,从而掌握不同长度基因序列的内在相关性。然后,通过带有嵌入层的卷积神经网络构建了一个深度学习框架来执行 ORI 识别任务。基于相似性降维图的分析,Word2vec 可以有效地将单词之间的内在关系转化为数值特征。在本研究的四个物种中,最佳模型的整体准确率为 0.975、0.765、0.885、0.967,马修相关系数为 0.940、0.530、0.771、0.934,AUC 为 0.975、0.800、0.888、0.981,表明所提出的预测器具有稳定的能力,并提供高置信系数来对 ORIs 和非 ORIs 进行分类。与最先进的方法相比,所提出的预测器可以实现 ORI 识别的显著改进。因此,可以合理地预测,所提出的方法将成为基因组分析的有用高通量工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bd1/7804333/55b109d7c609/41598_2020_80670_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验