Išerić Hamza, Alkan Can, Hach Faraz, Numanagić Ibrahim
Department of Computer Science, University of Victoria, Victoria, BC, V8P 5C2, Canada.
Department of Computer Engineering, Bilkent University, 06800, Ankara, Turkey.
Algorithms Mol Biol. 2022 Mar 18;17(1):4. doi: 10.1186/s13015-022-00210-2.
The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today.
Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7-33[Formula: see text] speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years.
BISER is implemented in Seq programming language and is publicly available at https://github.com/0xTCG/biser .
高质量基因组组装的可用性不断提高,引发了人们对基因组结构特征的兴趣。主要的结构元件,如常见重复序列和片段重复(SDs),增加了基因组可塑性,通过改变基因组结构和产生新基因来刺激进一步的进化。在基因组中对SDs进行最优计算需要二次时间局部比对算法,由于大多数基因组的大小,这种算法不切实际。此外,为了进行进化分析,需要对多个基因组中的SDs进行特征描述,并找出这些SDs与其他基因组中独特(非重复)片段之间的关系。一种由多序列比对组成的简单方法会使这个问题的最优解更加不切实际。因此,需要快速准确的算法来描述多个基因组组装中的SD结构,以便更好地理解塑造当今基因组的进化力量。
在这里,我们引入了一种新方法BISER,用于快速检测多个基因组中的SDs,并识别驱动此类SDs形成的基本SDs和核心重复子。BISER改进了早期工具,方法如下:(i)将低同源性SDs的检测扩展到多个基因组,同时比现有工具进一步提速7 - 33[公式:见原文],以及(ii)对基本SDs进行特征描述并检测核心重复子,以帮助追溯重复的进化历史至3亿年前。
BISER用Seq编程语言实现,可在https://github.com/0xTCG/biser上公开获取。