Suppr超能文献

基于核苷酸跳字神经网络的病毒基因组进化特征学习

Feature Learning of Virus Genome Evolution With the Nucleotide Skip-Gram Neural Network.

作者信息

Shim Hyunjin

机构信息

Artificial Intelligence Laboratory, Stanford University, Stanford, CA, USA.

School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.

出版信息

Evol Bioinform Online. 2019 Jan 10;15:1176934318821072. doi: 10.1177/1176934318821072. eCollection 2019.

Abstract

Recent studies reveal that even the smallest genomes such as viruses evolve through complex and stochastic processes, and the assumption of independent alleles is not valid in most applications. Advances in sequencing technologies produce multiple time-point whole-genome data, which enable potential interactions between these alleles to be investigated empirically. To investigate these interactions, we represent alleles as distributed vectors that encode for relationships with other alleles in the course of evolution and apply artificial neural networks to time-sampled whole-genome datasets for feature learning. We build this platform using methods and algorithms derived from natural language processing (NLP), and we denote it as the nucleotide skip-gram neural network. We learn distributed vectors of alleles using the changes in allele frequency of echovirus 11 in the presence or absence of the disinfectant (ClO) from the experimental evolution data. Results from the training using a new open-source software TensorFlow show that the learned distributed vectors can be clustered using principal component analysis and hierarchical clustering to reveal a list of non-synonymous mutations that arise on the structural protein VP1 in connection to the candidate mutation for ClO adaptation. Furthermore, this method can account for recombination rates by setting the extent of interactions as a biological hyper-parameter, and the results show that the most realistic scenario of mid-range interactions across the genome is most consistent with the previous studies.

摘要

最近的研究表明,即使是像病毒这样最小的基因组也是通过复杂且随机的过程进化的,并且在大多数应用中,等位基因独立的假设并不成立。测序技术的进步产生了多个时间点的全基因组数据,这使得能够通过实证研究这些等位基因之间的潜在相互作用。为了研究这些相互作用,我们将等位基因表示为分布式向量,这些向量编码了进化过程中与其他等位基因的关系,并将人工神经网络应用于时间采样的全基因组数据集进行特征学习。我们使用源自自然语言处理(NLP)的方法和算法构建这个平台,并将其称为核苷酸跳元神经网络。我们利用实验进化数据中存在或不存在消毒剂(ClO)时埃可病毒11等位基因频率的变化来学习等位基因的分布式向量。使用新的开源软件TensorFlow进行训练的结果表明,通过主成分分析和层次聚类可以对学习到的分布式向量进行聚类,以揭示与ClO适应性候选突变相关的结构蛋白VP1上出现的非同义突变列表。此外,该方法可以通过将相互作用的程度设置为生物学超参数来考虑重组率,结果表明,基因组范围内中等程度相互作用的最现实情况与先前的研究最为一致。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d2b/6335656/0719c38d7f81/10.1177_1176934318821072-fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验