Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia.
Department of Computer Science and Engineering, University of California San Diego, San Diego, CA, USA.
Microbiome. 2021 Jun 28;9(1):149. doi: 10.1186/s40168-021-01092-z.
Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics.
Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG.
We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes.
We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.
由于杀虫剂蛋白的长期使用导致了抗药性,因此寻找新型杀虫蛋白基因(IPG)以有效控制抗性昆虫种群非常重要。IPG 通常编码在昆虫病原细菌的基因组中,特别是在无处不在的土壤细菌苏云金芽孢杆菌(Bt)菌株的大型质粒中。由于这些质粒通常编码多个类似的 IPG,因此它们的组装通常是碎片化的,许多 IPG 分散在多个 contigs 中。因此,现有的基因预测工具(分析单个 contigs)通常预测部分而不是完整的 IPG,这使得在农业基因组学中难以进行下游的 IPG 工程工作。
尽管很难将 IPG 组装成单个 contig,但基因组组装图的结构通常提供了如何将多个 contig 组合成编码单个 IPG 的片段的线索。
我们描述了 ORFograph,这是一种用于在组装图中预测 IPG 的流水线,在(宏)基因组数据集上对其进行了基准测试,并发现了近一百个新的 IPG。这项工作表明,基于图的基因预测工具能够从(宏)基因组中发现更多种类的 IPG。
我们证明了对组装图的分析揭示了新的候选 IPG。ORFograph 不仅识别了已经隐藏在组装图中的基因,还识别了逃避现有 IPG 识别工具的潜在新的 IPG。由于 ORFograph 速度很快,人们可以想象一个处理许多(宏)基因组组装图的流水线,以比传统基因发现方法以前无法访问的方式识别更多用于表型测试的新型 IPG。虽然这里我们仅展示了 ORFograph 在 IPG 方面的结果,但所提出的方法可以推广到任何一类基因。视频摘要。