Zhang Quan, Ye Yuzhen
School of Informatics and Computing, Indiana University, 150 S. Woodlawn Ave, Bloomington, IN, 47405, USA.
BMC Bioinformatics. 2017 Feb 6;18(1):92. doi: 10.1186/s12859-017-1512-4.
The CRISPR-Cas systems in prokaryotes are RNA-guided immune systems that target and deactivate foreign nucleic acids. A typical CRISPR-Cas system consists of a CRISPR array of repeat and spacer units, and a locus of cas genes. The CRISPR and the cas locus are often located next to each other in the genomes. However, there is no quantitative estimate of the co-location. In addition, ad-hoc studies have shown that some non-CRISPR genomic elements contain repeat-spacer-like structures and are mistaken as CRISPRs.
Using available genome sequences, we observed that a significant number of genomes have isolated cas loci and/or CRISPRs. We found that 11%, 22% and 28% of the type I, II and III cas loci are isolated (without CRISPRs in the same genomes at all or with CRISPRs distant in the genomes), respectively. We identified a large number of genomic elements that superficially reassemble CRISPRs but don't contain diverse spacers and have no companion cas genes. We called these elements false-CRISPRs and further classified them into groups, including tandem repeats and Staphylococcus aureus repeat (STAR)-like elements.
This is the first systematic study to collect and characterize false-CRISPR elements. We demonstrated that false-CRISPRs could be used to reduce the false annotation of CRISPRs, therefore showing them to be useful for improving the annotation of CRISPR-Cas systems.
原核生物中的CRISPR-Cas系统是一种RNA引导的免疫系统,可靶向并失活外源核酸。典型的CRISPR-Cas系统由重复序列和间隔序列组成的CRISPR阵列以及cas基因座组成。CRISPR和cas基因座在基因组中通常彼此相邻。然而,目前尚无关于它们共定位的定量估计。此外,一些专门研究表明,某些非CRISPR基因组元件包含类似重复序列-间隔序列的结构,容易被误认为是CRISPR。
利用现有的基因组序列,我们观察到大量基因组具有孤立的cas基因座和/或CRISPR。我们发现,I型、II型和III型cas基因座分别有11%、22%和28%是孤立的(即同一基因组中完全没有CRISPR,或者基因组中的CRISPR距离很远)。我们鉴定出大量表面上类似CRISPR但不包含多样间隔序列且没有伴随cas基因的基因组元件。我们将这些元件称为假CRISPR,并进一步将它们分类,包括串联重复序列和金黄色葡萄球菌重复序列(STAR)样元件。
这是首次收集和表征假CRISPR元件的系统性研究。我们证明假CRISPR可用于减少CRISPR的错误注释,因此表明它们有助于改进CRISPR-Cas系统的注释。