Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
Institute of Microbiology, Kiel University, Kiel 24118, Germany.
Bioinformatics. 2020 Jul 1;36(Suppl_1):i21-i29. doi: 10.1093/bioinformatics/btaa503.
An important task in comparative genomics is to detect functional units by analyzing gene-context patterns. Colinear syntenic blocks (CSBs) are groups of genes that are consistently encoded in the same neighborhood and in the same order across a wide range of taxa. Such CSBs are likely essential for the regulation of gene expression in prokaryotes. Recent results indicate that colinearity can be conserved across multiple operons, thus motivating the discovery of multi-operon CSBs. This computational task raises scalability challenges in large datasets.
We propose an efficient algorithm for the discovery of cross-strand multi-operon CSBs in large genomic datasets. The proposed algorithm uses match-point arithmetic, which is scalable for large datasets of microbial genomes in terms of running time and space requirements. The algorithm is implemented and incorporated into a tool with a graphical user interface, called CSBFinder-S. We applied CSBFinder-S to data mine 1485 prokaryotic genomes and analyzed the identified cross-strand CSBs. Our results indicate that most of the syntenic blocks are exclusively colinear. Additional results indicate that transcriptional regulation by overlapping transcriptional genes is abundant in bacteria. We demonstrate the utility of CSBFinder-S to identify common function of the gene-pair PulEF in multiple contexts, including Type 2 Secretion System, Type 4 Pilus System and DNA uptake machinery.
CSBFinder-S software and code are publicly available at https://github.com/dinasv/CSBFinder.
Supplementary data are available at Bioinformatics online.
比较基因组学中的一个重要任务是通过分析基因-上下文模式来检测功能单元。共线性同线性块(CSB)是一组基因,它们在广泛的分类群中始终以相同的邻近性和相同的顺序编码。这样的 CSB 可能对原核生物中基因表达的调控至关重要。最近的研究结果表明,共线性可以跨多个操纵子保存,从而激发了多操纵子 CSB 的发现。这项计算任务在大型数据集上提出了可扩展性挑战。
我们提出了一种在大型基因组数据集中发现跨链多操纵子 CSB 的有效算法。所提出的算法使用匹配点算法,根据微生物基因组的大型数据集在运行时间和空间需求方面具有可扩展性。该算法已实现并集成到一个带有图形用户界面的工具中,称为 CSBFinder-S。我们应用 CSBFinder-S 挖掘了 1485 个原核基因组,并分析了鉴定出的跨链 CSB。我们的结果表明,大多数同线性块是专有的共线性。其他结果表明,重叠转录基因的转录调控在细菌中很丰富。我们证明了 CSBFinder-S 用于识别基因对 PulEF 在多个上下文中的常见功能的实用性,包括 2 型分泌系统、4 型菌毛系统和 DNA 摄取机制。
CSBFinder-S 软件和代码可在 https://github.com/dinasv/CSBFinder 上公开获取。
补充数据可在生物信息学在线获得。