Department of Biology, University of Copenhagen, Copenhagen, Denmark.
Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Science, University of Copenhagen, Copenhagen, Denmark.
PeerJ. 2022 Sep 20;10:e13666. doi: 10.7717/peerj.13666. eCollection 2022.
One way to better understand the structure in DNA is by learning to predict the sequence. Here, we trained a model to predict the missing base at any given position, given its left and right flanking contexts. Our best-performing model was a neural network that obtained an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model. In likelihood-ratio tests, the neural network performed significantly better than any of the alternative models by a large margin. We report on where the accuracy was obtained, first observing that the performance appeared to be uniform over the chromosomes. The models performed best in repetitive sequences, as expected, although their performance far from random in the more difficult coding sections, the proportions being ~70:40%. We further explored the sources of the accuracy, Fourier transforming the predictions revealed weak but clear periodic signals. In the human genome the characteristic periods hinted at connections to nucleosome positioning. We found similar periodic signals in GC/AT content in the human genome, which to the best of our knowledge have not been reported before. On other large genomes similarly high accuracy was found, while lower predictive accuracy was observed on smaller genomes. Only in the mouse genome did we see periodic signals in the same range as in the human genome, though weaker and of a different type. This indicates that the sources of these signals are other or more than nucleosome arrangement. Interestingly, applying a model trained on the mouse genome to the human genome resulted in a performance far below that of the human model, except in the difficult coding regions. Despite the clear outcomes of the likelihood-ratio tests, there is currently a limited superiority of the neural network methods over the Markov model. We expect, however, that there is great potential for better modelling DNA using different neural network architectures.
一种更好地理解 DNA 结构的方法是学习预测序列。在这里,我们训练了一个模型,以便在给定左右侧翼上下文的情况下预测任何给定位置的缺失碱基。表现最好的模型是一个神经网络,它在人类基因组上的准确率接近 54%,比使用马尔可夫模型对数据建模的准确率高 2 个百分点。在似然比检验中,神经网络的表现明显优于任何替代模型,优势非常大。我们报告了准确率是如何获得的,首先观察到,在染色体上,性能似乎是均匀的。正如预期的那样,模型在重复序列中的表现最好,尽管它们在更困难的编码部分的表现远非随机,比例约为 70:40%。我们进一步探讨了准确性的来源,对预测进行傅里叶变换显示出微弱但清晰的周期性信号。在人类基因组中,特征周期暗示与核小体定位有关。我们在人类基因组的 GC/AT 含量中发现了类似的周期性信号,据我们所知,这以前从未报道过。在其他大型基因组上也发现了类似的高准确率,而在较小的基因组上则观察到较低的预测准确率。只有在老鼠基因组中,我们才看到了与人类基因组中相同范围的周期性信号,尽管强度较弱,类型不同。这表明这些信号的来源是核小体排列以外的或更多的。有趣的是,将在老鼠基因组上训练的模型应用于人类基因组,除了在困难的编码区域外,其性能远低于人类模型。尽管似然比检验的结果很明显,但神经网络方法目前相对于马尔可夫模型的优势有限。然而,我们预计,使用不同的神经网络架构更好地建模 DNA 具有很大的潜力。