人类测序数据中非人类序列的大规模比较。

Large scale comparison of non-human sequences in human sequencing data.

作者信息

Tae Hongseok, Karunasena Enusha, Bavarva Jasmin H, McIver Lauren J, Garner Harold R

机构信息

Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA.

出版信息

Genomics. 2014 Dec;104(6 Pt B):453-8. doi: 10.1016/j.ygeno.2014.08.009. Epub 2014 Aug 27.

DOI:10.1016/j.ygeno.2014.08.009

PMID:25173571

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4262678/

Abstract

Several studies have demonstrated that unmapped reads in next generation sequencing data could be used to identify infectious agents or structural variants, but there has been no intensive effort to analyze and classify all non-human sequences found in individual large data sets. To identify commonality in non-human sequences by infectious agents and putative contamination events, we analyzed non-human sequences in 150 genomic sequencing data files from the 1000 Genomes Project and observed that 0.13% of reads on average showed similarities to non-human genomes. We compared results among different sample groups divided based on ethnicities, sequencing centers and enrichment methods (whole genome sequencing vs. exome sequencing) and found that sequencing centers had specific signatures of contaminating genomes as 'time stamps'. We also observed many unmapped reads that falsely indicated contamination because of the high similarity of human sequences to sequences in non-human genome assemblies such as mouse and Nicotiana.

摘要

多项研究表明，下一代测序数据中未映射的读段可用于识别感染因子或结构变异，但尚未有人集中精力对单个大数据集中发现的所有非人类序列进行分析和分类。为了通过感染因子和假定的污染事件识别非人类序列中的共性，我们分析了来自千人基因组计划的150个基因组测序数据文件中的非人类序列，发现平均0.13%的读段与非人类基因组具有相似性。我们比较了根据种族、测序中心和富集方法（全基因组测序与外显子组测序）划分的不同样本组之间的结果，发现测序中心具有作为“时间戳”的污染基因组的特定特征。我们还观察到许多未映射的读段因人类序列与非人类基因组组装体（如小鼠和烟草）中的序列高度相似而错误地表明存在污染。

相似文献

Large scale comparison of non-human sequences in human sequencing data.人类测序数据中非人类序列的大规模比较。

Genomics. 2014 Dec;104(6 Pt B):453-8. doi: 10.1016/j.ygeno.2014.08.009. Epub 2014 Aug 27.

From trash to treasure: detecting unexpected contamination in unmapped NGS data.从垃圾到宝藏：检测未映射 NGS 数据中的意外污染。

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):168. doi: 10.1186/s12859-019-2684-x.

Sequencing facility and DNA source associated patterns of virus-mappable reads in whole-genome sequencing data.全基因组测序数据中可定位病毒读段的测序设施和 DNA 来源相关模式。

Genomics. 2021 Jan;113(1 Pt 2):1189-1198. doi: 10.1016/j.ygeno.2020.12.004. Epub 2020 Dec 7.

Whole-genome re-sequencing of non-model organisms: lessons from unmapped reads.非模式生物的全基因组重测序：来自未映射 reads 的经验教训。

Heredity (Edinb). 2015 May;114(5):494-501. doi: 10.1038/hdy.2014.85. Epub 2014 Oct 1.

Exploring the unmapped DNA and RNA reads in a songbird genome.探索鸣禽基因组中的未映射 DNA 和 RNA 读数。

BMC Genomics. 2019 Jan 8;20(1):19. doi: 10.1186/s12864-018-5378-2.

The human "contaminome": bacterial, viral, and computational contamination in whole genome sequences from 1000 families.人类“污染组”：1000 个家庭的全基因组序列中的细菌、病毒和计算污染。

Sci Rep. 2022 Jun 14;12(1):9863. doi: 10.1038/s41598-022-13269-z.

Contaminating DNA in human saliva alters the detection of variants from whole genome sequencing.人类唾液中的污染 DNA 会改变全基因组测序中对变体的检测。

Sci Rep. 2020 Nov 6;10(1):19255. doi: 10.1038/s41598-020-76022-4.

Human contamination in bacterial genomes has created thousands of spurious proteins.人类污染的细菌基因组中创造了数千个虚假蛋白质。

Genome Res. 2019 Jun;29(6):954-960. doi: 10.1101/gr.245373.118. Epub 2019 May 7.

Human Contamination in Public Genome Assemblies.公共基因组组装中的人类污染

PLoS One. 2016 Sep 9;11(9):e0162424. doi: 10.1371/journal.pone.0162424. eCollection 2016.

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences.填补人类参考基因组缺口：识别和表征缺口填补序列

G3 (Bethesda). 2020 Aug 5;10(8):2801-2809. doi: 10.1534/g3.120.401280.

引用本文的文献

hgtseq: A Standard Pipeline to Study Horizontal Gene Transfer.hgtseq：一种研究水平基因转移的标准流程。

Int J Mol Sci. 2022 Nov 22;23(23):14512. doi: 10.3390/ijms232314512.

Whole-Genome Sequencing Reveals Age-Specific Changes in the Human Blood Microbiota.全基因组测序揭示人类血液微生物群的年龄特异性变化。

J Pers Med. 2022 Jun 7;12(6):939. doi: 10.3390/jpm12060939.

Bovine Meat and Milk Factors (BMMFs): Their Proposed Role in Common Human Cancers and Type 2 Diabetes Mellitus.牛源肉和奶因子（BMMFs）：它们在常见人类癌症和2型糖尿病中的潜在作用。

Cancers (Basel). 2021 Oct 28;13(21):5407. doi: 10.3390/cancers13215407.

Bacterial Diversity Correlates with Overall Survival in Cancers of the Head and Neck, Liver, and Stomach.细菌多样性与头颈部、肝脏和胃部癌症的总生存率相关。

Molecules. 2021 Sep 17;26(18):5659. doi: 10.3390/molecules26185659.

Harmonization of whole-genome sequencing for outbreak surveillance of and .全基因组测序在和爆发监测中的协调应用。

Microb Genom. 2021 Jul;7(7). doi: 10.1099/mgen.0.000567.

Transcriptome Meta-Assembly of the Mixotrophic Freshwater Microalga .混合营养型淡水微藻的转录组元组装

Genes (Basel). 2021 May 29;12(6):842. doi: 10.3390/genes12060842.

Unmapped exome reads implicate a role for Anelloviridae in childhood HIV-1 long-term non-progression.未映射的外显子组读数表明圆环病毒科在儿童HIV-1长期不进展中起作用。

NPJ Genom Med. 2021 Mar 19;6(1):24. doi: 10.1038/s41525-021-00185-w.

Genomics. 2021 Jan;113(1 Pt 2):1189-1198. doi: 10.1016/j.ygeno.2020.12.004. Epub 2020 Dec 7.

Tissue-associated microbial detection in cancer using human sequencing data.利用人类测序数据检测癌症中的组织相关微生物。

BMC Bioinformatics. 2020 Dec 3;21(Suppl 9):523. doi: 10.1186/s12859-020-03831-9.

The landscape of bacterial presence in tumor and adjacent normal tissue across 9 major cancer types using TCGA exome sequencing.利用TCGA外显子组测序技术，对9种主要癌症类型的肿瘤及相邻正常组织中的细菌存在情况进行分析。

Comput Struct Biotechnol J. 2020 Mar 13;18:631-641. doi: 10.1016/j.csbj.2020.03.003. eCollection 2020.

本文引用的文献

Whole-genome sequencing of Oryza brachyantha reveals mechanisms underlying Oryza genome evolution.《糙野生稻全基因组测序揭示了稻属基因组进化的机制》

Nat Commun. 2013;4:1595. doi: 10.1038/ncomms2596.

An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

Search for an aetiological virus candidate in chronic lymphocytic leukaemia by extensive transcriptome analysis.通过广泛的转录组分析寻找慢性淋巴细胞白血病的病因候选病毒。

Br J Haematol. 2012 Jun;157(6):709-17. doi: 10.1111/j.1365-2141.2012.09116.x. Epub 2012 Apr 10.

Rapid identification of non-human sequences in high-throughput sequencing datasets.高通量测序数据中非人类序列的快速鉴定。

Bioinformatics. 2012 Apr 15;28(8):1174-5. doi: 10.1093/bioinformatics/bts100. Epub 2012 Feb 28.

Comparison of solution-based exome capture methods for next generation sequencing.基于溶液的外显子组捕获方法在下一代测序中的比较。

Genome Biol. 2011 Sep 28;12(9):R94. doi: 10.1186/gb-2011-12-9-r94.

Pathogen detection using short-RNA deep sequencing subtraction and assembly.使用短 RNA 深度测序消减和组装进行病原体检测。

Bioinformatics. 2011 Aug 1;27(15):2027-30. doi: 10.1093/bioinformatics/btr349. Epub 2011 Jun 11.

PathSeq: software to identify or discover microbes by deep sequencing of human tissue.PathSeq：通过对人体组织进行深度测序来识别或发现微生物的软件。

Nat Biotechnol. 2011 May;29(5):393-6. doi: 10.1038/nbt.1868.

Whole exome capture in solution with 3 Gbp of data.溶液中捕获全外显子组，数据量 30 亿位。

Genome Biol. 2010;11(6):R62. doi: 10.1186/gb-2010-11-6-r62. Epub 2010 Jun 17.

Challenges of sequencing human genomes.人类基因组测序的挑战。

Brief Bioinform. 2010 Sep;11(5):484-98. doi: 10.1093/bib/bbq016. Epub 2010 Jun 2.

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.Pindel：一种基于模式增长的方法，可从配对末端短读取中检测到大的缺失和中等大小的插入的断点。

Bioinformatics. 2009 Nov 1;25(21):2865-71. doi: 10.1093/bioinformatics/btp394. Epub 2009 Jun 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验