Vorontsov Ilya E, Kozin Ivan, Abramov Sergey, Boytsov Alexandr, Jolma Arttu, Albu Mihai, Ambrosini Giovanna, Faltejskova Katerina, Gralak Antoni J, Gryzunov Nikita, Inukai Sachi, Kolmykov Semyon, Kravchenko Pavel, Kribelbauer-Swietek Judith F, Laverty Kaitlin U, Nozdrin Vladimir, Patel Zain M, Penzar Dmitry, Plescher Marie-Luise, Pour Sara E, Razavi Rozita, Yang Ally W H, Yevshin Ivan, Zinkevich Arsenii, Weirauch Matthew T, Bucher Philipp, Deplancke Bart, Fornes Oriol, Grau Jan, Grosse Ivo, Kolpakov Fedor A, Makeev Vsevolod J, Hughes Timothy R, Kulakovskiy Ivan V
Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.
Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia.
bioRxiv. 2024 Nov 13:2024.11.11.619379. doi: 10.1101/2024.11.11.619379.
A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications. We applied ten different DNA motif discovery tools to generate PWMs from the "Codebook" data comprised of 4,237 experiments from five different platforms profiling the DNA-binding specificity of 394 human proteins, focusing on understudied transcription factors of different structural families. For many of the proteins, there was no prior knowledge of a genuine motif. By benchmarking-supported human curation, we constructed an approved subset of experiments comprising about 30% of all experiments and 50% of tested TFs which displayed consistent motifs across platforms and replicates. We present the Codebook Motif Explorer (https://mex.autosome.org), a detailed online catalog of DNA motifs, including the top-ranked PWMs, and the underlying source and benchmarking data. We demonstrate that in the case of high-quality experimental data, most of the popular motif discovery tools detect valid motifs and generate PWMs, which perform well both on genomic and synthetic data. Yet, for each of the algorithms, there were problematic combinations of proteins and platforms, and the basic motif properties such as nucleotide composition and information content offered little help in detecting such pitfalls. By combining multiple PMWs in decision trees, we demonstrate how our setup can be readily adapted to train and test binding specificity models more complex than PWMs. Overall, our study provides a rich motif catalog as a solid baseline for advanced models and highlights the power of the multi-platform multi-tool approach for reliable mapping of DNA binding specificities.
DNA序列模式,即“基序”,是转录因子(TF)DNA结合特异性的重要表现形式。由于基础实验数据和计算基序发现算法的缺陷,任何特定的基序模型都可能存在潜在缺陷。作为密码本/ GRECO - BIT计划的一部分,我们在此大规模评估了位置权重矩阵(PWM)的跨平台识别性能,PWM在许多实际应用中仍然是流行的基序模型。我们应用了十种不同的DNA基序发现工具,从“密码本”数据中生成PWM,该数据由来自五个不同平台的4237个实验组成,这些实验分析了394种人类蛋白质的DNA结合特异性,重点关注不同结构家族中研究较少的转录因子。对于许多蛋白质,之前并没有真正基序的相关知识。通过基于基准支持的人工筛选,我们构建了一个经过批准的实验子集,该子集包含约30%的所有实验和50%经过测试的TF,这些TF在不同平台和重复实验中显示出一致的基序。我们展示了密码本基序浏览器(https://mex.autosome.org),这是一个详细的DNA基序在线目录,包括排名靠前的PWM以及基础来源和基准数据。我们证明,在高质量实验数据的情况下,大多数流行的基序发现工具都能检测到有效的基序并生成PWM,这些PWM在基因组数据和合成数据上都表现良好。然而,对于每种算法,都存在蛋白质和平台的问题组合,并且诸如核苷酸组成和信息含量等基本基序属性在检测此类陷阱方面帮助不大。通过在决策树中组合多个PWM,我们展示了我们的设置如何能够很容易地适应训练和测试比PWM更复杂的结合特异性模型。总体而言,我们的研究提供了一个丰富的基序目录作为高级模型的坚实基础,并突出了多平台多工具方法在可靠映射DNA结合特异性方面的强大作用。