使用深度卷积神经网络对跨物种的基因组进行注释。

Genome annotation across species using deep convolutional neural networks.

作者信息

Khodabandelou Ghazaleh, Routhier Etienne, Mozziconacci Julien

机构信息

Laboratoire de Physique Théorique de la Matière Condensée (LPTMC), Sorbonne Université, Paris, France.

Laboratoire Images, Signaux et Systèmes Intelligents (LISSI), Université Val-de-Marne (Paris XII), Paris, France.

出版信息

PeerJ Comput Sci. 2020 Jun 15;6:e278. doi: 10.7717/peerj-cs.278. eCollection 2020.

DOI:10.7717/peerj-cs.278

PMID:33816929

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7924482/

Abstract

Application of deep neural network is a rapidly expanding field now reaching many disciplines including genomics. In particular, convolutional neural networks have been exploited for identifying the functional role of short genomic sequences. These approaches rely on gathering large sets of sequences with known functional role, extracting those sequences from whole-genome-annotations. These sets are then split into learning, test and validation sets in order to train the networks. While the obtained networks perform well on validation sets, they often perform poorly when applied on whole genomes in which the ratio of positive over negative examples can be very different than in the training set. We here address this issue by assessing the genome-wide performance of networks trained with sets exhibiting different ratios of positive to negative examples. As a case study, we use sequences encompassing gene starts from the RefGene database as positive examples and random genomic sequences as negative examples. We then demonstrate that models trained using data from one organism can be used to predict gene-start sites in a related species, when using training sets providing good genome-wide performance. This cross-species application of convolutional neural networks provides a new way to annotate any genome from existing high-quality annotations in a related reference species. It also provides a way to determine whether the sequence motifs recognised by chromatin-associated proteins in different species are conserved or not.

摘要

深度神经网络的应用是一个迅速扩展的领域，目前已涉及包括基因组学在内的许多学科。特别是，卷积神经网络已被用于识别短基因组序列的功能作用。这些方法依赖于收集大量具有已知功能作用的序列集，从全基因组注释中提取这些序列。然后将这些集合划分为学习集、测试集和验证集，以便训练网络。虽然所获得的网络在验证集上表现良好，但当应用于全基因组时，它们的表现往往很差，因为全基因组中正反例的比例可能与训练集中的比例非常不同。我们在此通过评估用具有不同正负例比例的集合训练的网络的全基因组性能来解决这个问题。作为一个案例研究，我们使用来自RefGene数据库的包含基因起始位点的序列作为正例，随机基因组序列作为负例。然后我们证明，当使用提供良好全基因组性能的训练集时，使用来自一个生物体的数据训练的模型可用于预测相关物种中的基因起始位点。卷积神经网络的这种跨物种应用提供了一种从相关参考物种中现有的高质量注释对任何基因组进行注释的新方法。它还提供了一种确定不同物种中与染色质相关的蛋白质识别的序列基序是否保守的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0d8/7924482/a29d846f7983/peerj-cs-06-278-g001.jpg

相似文献

Genome annotation across species using deep convolutional neural networks.

PeerJ Comput Sci. 2020 Jun 15;6:e278. doi: 10.7717/peerj-cs.278. eCollection 2020.

Chromatin accessibility prediction via a hybrid deep convolutional neural network.

Bioinformatics. 2018 Mar 1;34(5):732-738. doi: 10.1093/bioinformatics/btx679.

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA.

Gene X. 2020 May 13;5:100035. doi: 10.1016/j.gene.2020.100035. eCollection 2020 Dec.

Deep learning and support vector machines for transcription start site identification.

PeerJ Comput Sci. 2023 Apr 17;9:e1340. doi: 10.7717/peerj-cs.1340. eCollection 2023.

tRNA-DL: A Deep Learning Approach to Improve tRNAscan-SE Prediction Results.

Hum Hered. 2018;83(3):163-172. doi: 10.1159/000493215. Epub 2019 Jan 25.

Predicting enhancers with deep convolutional neural networks.

BMC Bioinformatics. 2017 Dec 1;18(Suppl 13):478. doi: 10.1186/s12859-017-1878-3.

The impact of different negative training data on regulatory sequence predictions.

PLoS One. 2020 Dec 1;15(12):e0237412. doi: 10.1371/journal.pone.0237412. eCollection 2020.

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.

Genome Res. 2016 Jul;26(7):990-9. doi: 10.1101/gr.200535.115. Epub 2016 May 3.

CNNSplice: Robust models for splice site prediction using convolutional neural networks.

Comput Struct Biotechnol J. 2023 May 30;21:3210-3223. doi: 10.1016/j.csbj.2023.05.031. eCollection 2023.

引用本文的文献

Navigating the archaeal frontier: insights and projections from bioinformatic pipelines.

Front Microbiol. 2024 Sep 23;15:1433224. doi: 10.3389/fmicb.2024.1433224. eCollection 2024.

Identification of DNA motif pairs on paired sequences based on composite heterogeneous graph.

Front Genet. 2024 Jun 17;15:1424085. doi: 10.3389/fgene.2024.1424085. eCollection 2024.

DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks.

Genes (Basel). 2024 Mar 26;15(4):404. doi: 10.3390/genes15040404.

Artificial Intelligence: A Promising Tool in Exploring the Phytomicrobiome in Managing Disease and Promoting Plant Health.

Plants (Basel). 2023 Apr 30;12(9):1852. doi: 10.3390/plants12091852.

Genomics enters the deep learning era.

PeerJ. 2022 Jun 24;10:e13613. doi: 10.7717/peerj.13613. eCollection 2022.

Spliceator: multi-species splice site prediction using convolutional neural networks.

BMC Bioinformatics. 2021 Nov 23;22(1):561. doi: 10.1186/s12859-021-04471-3.

Application of Deep Learning in Plant-Microbiota Association Analysis.

Front Genet. 2021 Oct 8;12:697090. doi: 10.3389/fgene.2021.697090. eCollection 2021.

ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation.

PLoS Comput Biol. 2021 Sep 7;17(9):e1009376. doi: 10.1371/journal.pcbi.1009376. eCollection 2021 Sep.

Opportunities and challenges for the computational interpretation of rare variation in clinically important genes.

Am J Hum Genet. 2021 Apr 1;108(4):535-548. doi: 10.1016/j.ajhg.2021.03.003.

Systems biology approaches integrated with artificial intelligence for optimized metabolic engineering.

Metab Eng Commun. 2020 Dec;11:e00149. doi: 10.1016/j.mec.2020.e00149. Epub 2020 Oct 9.

本文引用的文献

Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals.

Elife. 2020 Jan 27;9:e51503. doi: 10.7554/eLife.51503.

Solving the transcription start site identification problem with ADAPT-CAGE: a Machine Learning algorithm for the analysis of CAGE data.

Sci Rep. 2020 Jan 21;10(1):877. doi: 10.1038/s41598-020-57811-3.

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.

BMC Genomics. 2020 Jan 2;21(1):6. doi: 10.1186/s12864-019-6413-7.

TransPrise: a novel machine learning approach for eukaryotic promoter prediction.

PeerJ. 2019 Nov 1;7:e7990. doi: 10.7717/peerj.7990. eCollection 2019.

Predicting Splicing from Primary Sequence with Deep Learning.

Cell. 2019 Jan 24;176(3):535-548.e24. doi: 10.1016/j.cell.2018.12.015. Epub 2019 Jan 17.

A primer on deep learning in genomics.

Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26.

A universal SNP and small-indel variant caller using deep neural networks.

Nat Biotechnol. 2018 Nov;36(10):983-987. doi: 10.1038/nbt.4235. Epub 2018 Sep 24.

Deep learning in biomedicine.

Nat Biotechnol. 2018 Oct;36(9):829-838. doi: 10.1038/nbt.4233. Epub 2018 Sep 6.

PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition.

Nat Commun. 2018 Apr 11;9(1):1402. doi: 10.1038/s41467-018-03635-9.

Opportunities and obstacles for deep learning in biology and medicine.

J R Soc Interface. 2018 Apr;15(141). doi: 10.1098/rsif.2017.0387.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用深度卷积神经网络对跨物种的基因组进行注释。

Genome annotation across species using deep convolutional neural networks.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献