Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America.
Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, United States of America.
PLoS Comput Biol. 2018 Oct 4;14(10):e1006484. doi: 10.1371/journal.pcbi.1006484. eCollection 2018 Oct.
Genomic regions with gene regulatory enhancer activity turnover rapidly across mammals. In contrast, gene expression patterns and transcription factor binding preferences are largely conserved between mammalian species. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. To investigate this hypothesis, we evaluated the extent to which sequence patterns that are predictive of enhancers in one species are predictive of enhancers in other mammalian species by training and testing two types of machine learning models. We trained support vector machine (SVM) and convolutional neural network (CNN) classifiers to distinguish enhancers defined by histone marks from the genomic background based on DNA sequence patterns in human, macaque, mouse, dog, cow, and opossum. The classifiers accurately identified many adult liver, developing limb, and developing brain enhancers, and the CNNs outperformed the SVMs. Furthermore, classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. We observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. This indicates that many short sequence patterns predictive of enhancers are largely conserved. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Our results suggest that short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution.
基因组中具有基因调控增强子活性的区域在哺乳动物中快速变化。相比之下,基因表达模式和转录因子结合偏好在哺乳动物之间基本保守。基于这种保守性,我们假设不同哺乳动物中活跃的增强子在不同的基因组位置仍会表现出保守的序列模式。为了验证这一假设,我们通过训练和测试两种类型的机器学习模型来评估预测一种物种中增强子的序列模式在其他哺乳动物物种中预测增强子的程度。我们训练了支持向量机 (SVM) 和卷积神经网络 (CNN) 分类器,以根据人类、猕猴、小鼠、狗、牛和负鼠的 DNA 序列模式,从基因组背景中区分由组蛋白标记定义的增强子。这些分类器准确地识别了许多成年肝脏、发育中的肢体和发育中的大脑增强子,并且 CNN 优于 SVM。此外,在一个物种中训练并在另一个物种中测试的分类器的性能几乎与在同一物种中训练和测试的分类器一样好。当我们将模型应用于在转基因实验中验证的人类和小鼠增强子时,我们观察到了类似的跨物种保守性。这表明许多预测增强子的短序列模式在很大程度上是保守的。在每个物种中最能预测增强子的序列模式与在相关组织中表达丰富的一组常见 TF 的结合基序相匹配,支持了所学习特征的生物学相关性。因此,尽管哺乳动物之间活跃的增强子位置快速变化,但跨物种的增强子预测通常是可能的。我们的结果表明,编码增强子活性的短序列模式在超过 1.8 亿年的哺乳动物进化中得以维持。