Suppr超能文献

使用深度卷积神经网络对跨物种的基因组进行注释。

Genome annotation across species using deep convolutional neural networks.

作者信息

Khodabandelou Ghazaleh, Routhier Etienne, Mozziconacci Julien

机构信息

Laboratoire de Physique Théorique de la Matière Condensée (LPTMC), Sorbonne Université, Paris, France.

Laboratoire Images, Signaux et Systèmes Intelligents (LISSI), Université Val-de-Marne (Paris XII), Paris, France.

出版信息

PeerJ Comput Sci. 2020 Jun 15;6:e278. doi: 10.7717/peerj-cs.278. eCollection 2020.

Abstract

Application of deep neural network is a rapidly expanding field now reaching many disciplines including genomics. In particular, convolutional neural networks have been exploited for identifying the functional role of short genomic sequences. These approaches rely on gathering large sets of sequences with known functional role, extracting those sequences from whole-genome-annotations. These sets are then split into learning, test and validation sets in order to train the networks. While the obtained networks perform well on validation sets, they often perform poorly when applied on whole genomes in which the ratio of positive over negative examples can be very different than in the training set. We here address this issue by assessing the genome-wide performance of networks trained with sets exhibiting different ratios of positive to negative examples. As a case study, we use sequences encompassing gene starts from the RefGene database as positive examples and random genomic sequences as negative examples. We then demonstrate that models trained using data from one organism can be used to predict gene-start sites in a related species, when using training sets providing good genome-wide performance. This cross-species application of convolutional neural networks provides a new way to annotate any genome from existing high-quality annotations in a related reference species. It also provides a way to determine whether the sequence motifs recognised by chromatin-associated proteins in different species are conserved or not.

摘要

深度神经网络的应用是一个迅速扩展的领域,目前已涉及包括基因组学在内的许多学科。特别是,卷积神经网络已被用于识别短基因组序列的功能作用。这些方法依赖于收集大量具有已知功能作用的序列集,从全基因组注释中提取这些序列。然后将这些集合划分为学习集、测试集和验证集,以便训练网络。虽然所获得的网络在验证集上表现良好,但当应用于全基因组时,它们的表现往往很差,因为全基因组中正反例的比例可能与训练集中的比例非常不同。我们在此通过评估用具有不同正负例比例的集合训练的网络的全基因组性能来解决这个问题。作为一个案例研究,我们使用来自RefGene数据库的包含基因起始位点的序列作为正例,随机基因组序列作为负例。然后我们证明,当使用提供良好全基因组性能的训练集时,使用来自一个生物体的数据训练的模型可用于预测相关物种中的基因起始位点。卷积神经网络的这种跨物种应用提供了一种从相关参考物种中现有的高质量注释对任何基因组进行注释的新方法。它还提供了一种确定不同物种中与染色质相关的蛋白质识别的序列基序是否保守的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0d8/7924482/a29d846f7983/peerj-cs-06-278-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验