Shim Hyunjin
Artificial Intelligence Laboratory, Stanford University, Stanford, CA, USA.
School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
Evol Bioinform Online. 2019 Jan 10;15:1176934318821072. doi: 10.1177/1176934318821072. eCollection 2019.
Recent studies reveal that even the smallest genomes such as viruses evolve through complex and stochastic processes, and the assumption of independent alleles is not valid in most applications. Advances in sequencing technologies produce multiple time-point whole-genome data, which enable potential interactions between these alleles to be investigated empirically. To investigate these interactions, we represent alleles as distributed vectors that encode for relationships with other alleles in the course of evolution and apply artificial neural networks to time-sampled whole-genome datasets for feature learning. We build this platform using methods and algorithms derived from natural language processing (NLP), and we denote it as the nucleotide skip-gram neural network. We learn distributed vectors of alleles using the changes in allele frequency of echovirus 11 in the presence or absence of the disinfectant (ClO) from the experimental evolution data. Results from the training using a new open-source software TensorFlow show that the learned distributed vectors can be clustered using principal component analysis and hierarchical clustering to reveal a list of non-synonymous mutations that arise on the structural protein VP1 in connection to the candidate mutation for ClO adaptation. Furthermore, this method can account for recombination rates by setting the extent of interactions as a biological hyper-parameter, and the results show that the most realistic scenario of mid-range interactions across the genome is most consistent with the previous studies.
最近的研究表明,即使是像病毒这样最小的基因组也是通过复杂且随机的过程进化的,并且在大多数应用中,等位基因独立的假设并不成立。测序技术的进步产生了多个时间点的全基因组数据,这使得能够通过实证研究这些等位基因之间的潜在相互作用。为了研究这些相互作用,我们将等位基因表示为分布式向量,这些向量编码了进化过程中与其他等位基因的关系,并将人工神经网络应用于时间采样的全基因组数据集进行特征学习。我们使用源自自然语言处理(NLP)的方法和算法构建这个平台,并将其称为核苷酸跳元神经网络。我们利用实验进化数据中存在或不存在消毒剂(ClO)时埃可病毒11等位基因频率的变化来学习等位基因的分布式向量。使用新的开源软件TensorFlow进行训练的结果表明,通过主成分分析和层次聚类可以对学习到的分布式向量进行聚类,以揭示与ClO适应性候选突变相关的结构蛋白VP1上出现的非同义突变列表。此外,该方法可以通过将相互作用的程度设置为生物学超参数来考虑重组率,结果表明,基因组范围内中等程度相互作用的最现实情况与先前的研究最为一致。