全面发现人类肠道宏基因组中 CRISPR 靶向的末端冗余序列:病毒、质粒等。
Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more.
机构信息
Human Genetics Laboratory, National Institute of Genetics, Research Organization of Information and Systems, Mishima, Shizuoka, Japan.
The Graduate University for Advanced Studies, SOKENDAI, Mishima, Shizuoka, Japan.
出版信息
PLoS Comput Biol. 2021 Oct 21;17(10):e1009428. doi: 10.1371/journal.pcbi.1009428. eCollection 2021 Oct.
Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.
病毒是数量最多的生物实体,存在于所有环境中,感染所有细胞生物。与细胞生命相比,病毒的进化和起源知之甚少;病毒种类繁多,大多数缺乏与细胞基因的序列相似性。为了在不依赖数据库中参考病毒序列或标记基因的情况下发现病毒序列,这些标记基因可以表征特定的病毒类群,我们开发了一种基于成簇规律间隔短回文重复序列(CRISPR)的病毒推断分析管道。CRISPR 是一种原核核酸限制系统,可存储以前暴露的记忆。我们的方案可以推断出 CRISPR 靶向的序列,包括病毒、质粒和以前未被描述的元素,并使用未组装的短读宏基因组测序数据预测它们的宿主。通过分析人类肠道宏基因组数据,我们提取了 11391 个终止冗余的 CRISPR 靶向序列,这些序列可能是完整的环状基因组。这些序列包括 2154 个尾部噬菌体基因组,以及 257 个完整的 crAssphage 基因组、11 个超过 200 千碱基的基因组、766 个 Microviridae 物种的基因组、56 个 Inoviridae 物种的基因组以及 95 个以前未被描述的没有可靠预测的蛋白质编码基因的圆形小基因组。我们通过将原间隔区与分类学分配的 CRISPR 直接重复相关联,在门的分类学水平上预测了大约 70%发现的基因组的宿主。这些结果表明,我们的方案可以有效地进行从头推断 CRISPR 靶向序列及其宿主预测。