Pendulum Therapeutics, Inc., San Francisco, CA 94107, USA.
European Bioinformatics Institute, Cambridge CB10 1SD, UK.
Bioinformatics. 2022 Jun 24;38(Suppl 1):i36-i44. doi: 10.1093/bioinformatics/btac238.
Genome-wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single-nucleotide polymorphisms to mobile genetic elements. This approach does not require a reference genome, making it easier to account for accessory genes. However, a same gene can exist in slightly different versions across different strains, leading to diluted effects.
Here, we overcome this issue by testing covariates built from closed connected subgraphs (CCSs) of the de Bruijn graph defined over genomic k-mers. These covariates capture polymorphic genes as a single entity, improving k-mer-based GWAS both in terms of power and interpretability. However, a method naively testing all possible subgraphs would be powerless due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all CCSs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. Our method integrates with existing visual tools to facilitate interpretation.
We provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_ISMB.
Supplementary data are available at Bioinformatics online.
全基因组关联研究(GWAS)旨在寻找与性状相关的遗传变异,已广泛应用于细菌,以鉴定耐药性或超强毒力的遗传决定因素。最近的细菌 GWAS 方法通常依赖于 k-mer,基因组中 k-mer 的存在可以表示从单核苷酸多态性到移动遗传元件的变异。这种方法不需要参考基因组,因此更容易解释辅助基因。然而,同一个基因在不同菌株中可能存在略有不同的版本,导致效应稀释。
在这里,我们通过测试基于基因组 k-mer 的 de Bruijn 图定义的闭连接子图(CCS)构建的协变量来克服这个问题。这些协变量将多态性基因作为一个整体进行捕获,提高了基于 k-mer 的 GWAS 的功效和可解释性。然而,由于多重测试校正,一种盲目测试所有可能子图的方法将无能为力,而仅仅探索这些子图将很快变得计算上不可行。可测试假设的概念已成功用于解决类似背景下的这两个问题。我们利用这个概念通过提出一种新的枚举方案来测试所有的 CCS 来解决这个问题,这种方案充分利用了可测试性提供的修剪机会,从而大大提高了计算效率。我们的方法与现有的可视化工具集成,以方便解释。
我们提供了我们方法的实现,以及在 https://github.com/HectorRDB/Caldera_ISMB 上重现所有结果的代码。
补充数据可在生物信息学在线获得。