Podlevsky Joshua D, Hudson Corey M, Timlin Jerilyn A, Williams Kelly P
Molecular and Microbiology, Sandia National Laboratories, Albuquerque, NM 87185, USA.
Computational Biology and Biophysics, Sandia National Laboratories, Albuquerque, NM 87185, USA.
NAR Genom Bioinform. 2020 Sep 3;2(3):lqaa063. doi: 10.1093/nargab/lqaa063. eCollection 2020 Sep.
CRISPR arrays and CRISPR-associated (Cas) proteins comprise a widespread adaptive immune system in bacteria and archaea. These systems function as a defense against exogenous parasitic mobile genetic elements that include bacteriophages, plasmids and foreign nucleic acids. With the continuous spread of antibiotic resistance, knowledge of pathogen susceptibility to bacteriophage therapy is becoming more critical. Additionally, gene-editing applications would benefit from the discovery of new genes with favorable properties. While next-generation sequencing has produced staggering quantities of data, transitioning from raw sequencing reads to the identification of CRISPR/Cas systems has remained challenging. This is especially true for metagenomic data, which has the highest potential for identifying novel genes. We report a comprehensive computational pipeline, CasCollect, for the targeted assembly and annotation of genes and CRISPR arrays-even isolated arrays-from raw sequencing reads. Benchmarking our targeted assembly pipeline demonstrates significantly improved timing by almost two orders of magnitude compared with conventional assembly and annotation, while retaining the ability to detect CRISPR arrays and genes. CasCollect is a highly versatile pipeline and can be used for targeted assembly of any specialty gene set, reconfigurable for user provided Hidden Markov Models and/or reference nucleotide sequences.
CRISPR阵列和CRISPR相关(Cas)蛋白构成了细菌和古细菌中广泛存在的适应性免疫系统。这些系统起到抵御外源寄生性移动遗传元件的作用,这些元件包括噬菌体、质粒和外源核酸。随着抗生素耐药性的不断传播,了解病原体对噬菌体疗法的敏感性变得愈发关键。此外,基因编辑应用将受益于具有良好特性的新基因的发现。虽然下一代测序产生了海量数据,但从原始测序读数过渡到识别CRISPR/Cas系统仍然具有挑战性。对于宏基因组数据而言尤其如此,宏基因组数据在识别新基因方面具有最大潜力。我们报告了一种全面的计算流程CasCollect,用于从原始测序读数中对基因和CRISPR阵列(甚至是孤立的阵列)进行靶向组装和注释。对我们的靶向组装流程进行基准测试表明,与传统组装和注释相比,时间显著缩短了近两个数量级,同时保留了检测CRISPR阵列和基因的能力。CasCollect是一个高度通用的流程,可用于任何特定基因集的靶向组装,可根据用户提供的隐马尔可夫模型和/或参考核苷酸序列进行重新配置。