Duncan Andrew G, Mitchell Jennifer A, Moses Alan M
Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae190.
Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited.
Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics.
The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.
监督深度学习用于对基因组序列与调控功能之间的复杂关系进行建模。了解这些模型如何进行预测可以为调控功能提供生物学见解。鉴于从序列到调控功能映射(顺式调控密码)的复杂性,有人提出基因组中包含的序列变异不足以训练具有适当复杂度的模型。数据增强是一种广泛使用的方法,用于增加可用于模型训练的数据变化,然而,目前用于基因组序列数据的数据增强方法是有限的。
受比较基因组学成功的启发,我们表明,用来自其他物种的进化相关序列增强基因组序列(我们称之为系统发育增强),可以提高在调控基因组序列上训练的深度学习模型预测高通量功能测定测量值的性能。此外,我们表明,当训练集进行下采样时,系统发育增强可以挽救模型性能,并允许在真实世界的小数据集上进行深度学习,这表明这种方法提高了数据效率。总体而言,这种数据增强方法代表了一种提高模型性能的解决方案,适用于基因组学中的许多监督深度学习问题。
开源的GitHub仓库agduncan94/phylogenetic_augmentation_paper包含了重新运行此处分析和重新创建图表的代码。