Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, SP 13566-590, Brazil.
Bioinformatics Group, Department of Computer Science, University of Freiburg, 79110 Freiburg, Germany.
Bioinformatics. 2021 Jun 16;37(10):1352-1359. doi: 10.1093/bioinformatics/btaa984.
CRISPR-Cas are important systems found in most archaeal and many bacterial genomes, providing adaptive immunity against mobile genetic elements in prokaryotes. The CRISPR-Cas systems are encoded by a set of consecutive cas genes, here termed cassette. The identification of cassette boundaries is key for finding cassettes in CRISPR research field. This is often carried out by using Hidden Markov Models and manual annotation. In this article, we propose the first method able to automatically define the cassette boundaries. In addition, we present a Cas-type predictive model used by the method to assign each gene located in the region defined by a cassette's boundaries a Cas label from a set of pre-defined Cas types. Furthermore, the proposed method can detect potentially new cas genes and decompose a cassette into its modules.
We evaluate the predictive performance of our proposed method on data collected from the two most recent CRISPR classification studies. In our experiments, we obtain an average similarity of 0.86 between the predicted and expected cassettes. Besides, we achieve F-scores above 0.9 for the classification of cas genes of known types and 0.73 for the unknown ones. Finally, we conduct two additional study cases, where we investigate the occurrence of potentially new cas genes and the occurrence of module exchange between different genomes.
https://github.com/BackofenLab/Casboundary.
Supplementary data are available at Bioinformatics online.
CRISPR-Cas 是在大多数古菌和许多细菌基因组中发现的重要系统,为原核生物中的移动遗传元件提供了适应性免疫。CRISPR-Cas 系统由一组连续的 cas 基因编码,这里称为盒。盒边界的识别是在 CRISPR 研究领域中寻找盒的关键。这通常通过使用隐马尔可夫模型和手动注释来完成。在本文中,我们提出了第一个能够自动定义盒边界的方法。此外,我们还提出了一种 Cas 型预测模型,该方法用于为位于盒边界定义的区域中定位的每个基因分配来自一组预定义 Cas 类型的 Cas 标签。此外,所提出的方法可以检测潜在的新 cas 基因,并将盒分解为其模块。
我们在最近的两项 CRISPR 分类研究中收集的数据上评估了我们提出的方法的预测性能。在我们的实验中,我们预测的和预期的盒之间的平均相似性为 0.86。此外,我们对已知类型的 cas 基因的分类获得了 F 分数高于 0.9,对未知类型的分类获得了 F 分数高于 0.73。最后,我们进行了另外两个案例研究,研究了潜在新 cas 基因的发生情况和不同基因组之间模块交换的发生情况。
https://github.com/BackofenLab/Casboundary。
补充数据可在生物信息学在线获得。