EMBL-EBI, Cambridge CB10 1SD, UK.
Nuffield Department of Medicine, University of Oxford, Oxford OX3 9DU, UK.
Bioinformatics. 2022 Jun 13;38(12):3291-3293. doi: 10.1093/bioinformatics/btac311.
Viral sequence data from clinical samples frequently contain contaminating human reads, which must be removed prior to sharing for legal and ethical reasons. To enable host read removal for SARS-CoV-2 sequencing data on low-specification laptops, we developed ReadItAndKeep, a fast lightweight tool for Illumina and nanopore data that only keeps reads matching the SARS-CoV-2 genome. Peak RAM usage is typically below 10 MB, and runtime less than 1 min. We show that by excluding the polyA tail from the viral reference, ReadItAndKeep prevents bleed-through of human reads, whereas mapping to the human genome lets some reads escape. We believe our test approach (including all possible reads from the human genome, human samples from each of the 26 populations in the 1000 genomes data and a diverse set of SARS-CoV-2 genomes) will also be useful for others.
ReadItAndKeep is implemented in C++, released under the MIT license, and available from https://github.com/GenomePathogenAnalysisService/read-it-and-keep.
Supplementary data are available at Bioinformatics online.
出于法律和伦理原因,在分享临床样本的病毒序列数据之前,必须先去除其中含有的污染人类读段。为了能够在低规格笔记本电脑上对 SARS-CoV-2 测序数据进行宿主读段去除,我们开发了 ReadItAndKeep,这是一个用于 Illumina 和纳米孔数据的快速轻量级工具,它只保留与 SARS-CoV-2 基因组匹配的读段。峰值 RAM 使用量通常低于 10MB,运行时间不到 1 分钟。我们表明,通过从病毒参考序列中排除 polyA 尾巴,ReadItAndKeep 可以防止人类读段的串扰,而映射到人类基因组则会让一些读段逃脱。我们相信我们的测试方法(包括人类基因组的所有可能读段、来自 1000 基因组数据中 26 个人群的每个人群的人类样本以及一组多样化的 SARS-CoV-2 基因组)对其他人也将是有用的。
ReadItAndKeep 是用 C++ 实现的,根据 MIT 许可证发布,并可从 https://github.com/GenomePathogenAnalysisService/read-it-and-keep 获得。
补充数据可在生物信息学在线获得。