Rafi Abdul Muntakim, Nogina Daria, Penzar Dmitry, Lee Dohoon, Lee Danyeong, Kim Nayeon, Kim Sangyeup, Kim Dohyeon, Shin Yeojin, Kwak Il-Youp, Meshcheryakov Georgy, Lando Andrey, Zinkevich Arsenii, Kim Byeong-Chan, Lee Juhyun, Kang Taein, Vaishnav Eeshit Dhaval, Yadollahpour Payman, Kim Sun, Albrecht Jake, Regev Aviv, Gong Wuming, Kulakovskiy Ivan V, Meyer Pablo, de Boer Carl G
University of British Columbia, Vancouver, British Columbia, Canada.
Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia.
Nat Biotechnol. 2024 Oct 11. doi: 10.1038/s41587-024-02414-w.
A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.
需要对模型架构和训练策略如何影响基因组学模型性能进行系统评估。为了填补这一空白,我们举办了一场DREAM挑战赛,让参赛者在数百万个随机启动子DNA序列及其在酵母中实验测定的相应表达水平的数据集上训练模型。为了对模型进行稳健评估,我们设计了一套全面的基准测试,涵盖各种序列类型。所有表现最佳的模型都使用了神经网络,但在架构和训练策略上有所不同。为了剖析架构和训练选择如何影响性能,我们开发了“固定价格”框架,将模型划分为模块化构建块。我们测试了排名前三的模型的所有可能组合,进一步提高了它们的性能。DREAM挑战赛模型不仅在我们全面的酵母数据集上取得了领先成果,而且在果蝇和人类基因组数据集上也始终超越了现有基准,证明了金标准基因组数据集能够推动的进展。