Tae Hongseok, Karunasena Enusha, Bavarva Jasmin H, McIver Lauren J, Garner Harold R
Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA.
Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA.
Genomics. 2014 Dec;104(6 Pt B):453-8. doi: 10.1016/j.ygeno.2014.08.009. Epub 2014 Aug 27.
Several studies have demonstrated that unmapped reads in next generation sequencing data could be used to identify infectious agents or structural variants, but there has been no intensive effort to analyze and classify all non-human sequences found in individual large data sets. To identify commonality in non-human sequences by infectious agents and putative contamination events, we analyzed non-human sequences in 150 genomic sequencing data files from the 1000 Genomes Project and observed that 0.13% of reads on average showed similarities to non-human genomes. We compared results among different sample groups divided based on ethnicities, sequencing centers and enrichment methods (whole genome sequencing vs. exome sequencing) and found that sequencing centers had specific signatures of contaminating genomes as 'time stamps'. We also observed many unmapped reads that falsely indicated contamination because of the high similarity of human sequences to sequences in non-human genome assemblies such as mouse and Nicotiana.
多项研究表明,下一代测序数据中未映射的读段可用于识别感染因子或结构变异,但尚未有人集中精力对单个大数据集中发现的所有非人类序列进行分析和分类。为了通过感染因子和假定的污染事件识别非人类序列中的共性,我们分析了来自千人基因组计划的150个基因组测序数据文件中的非人类序列,发现平均0.13%的读段与非人类基因组具有相似性。我们比较了根据种族、测序中心和富集方法(全基因组测序与外显子组测序)划分的不同样本组之间的结果,发现测序中心具有作为“时间戳”的污染基因组的特定特征。我们还观察到许多未映射的读段因人类序列与非人类基因组组装体(如小鼠和烟草)中的序列高度相似而错误地表明存在污染。