基因组垃圾堆挑战：从大鼠未映射的全基因组测序读数（包括菌株特异性基因组片段）中提取相关数据

The Genomic Scrapheap Challenge; Extracting Relevant Data from Unmapped Whole Genome Sequencing Reads, Including Strain Specific Genomic Segments, in Rats.

作者信息

van der Weide Robin H, Simonis Marieke, Hermsen Roel, Toonen Pim, Cuppen Edwin, de Ligt Joep

机构信息

Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands.

Division of Gene Regulation, The Netherlands Cancer Institute, Amsterdam, The Netherlands.

出版信息

PLoS One. 2016 Aug 8;11(8):e0160036. doi: 10.1371/journal.pone.0160036. eCollection 2016.

DOI:10.1371/journal.pone.0160036

PMID:27501045

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4976967/

Abstract

Unmapped next-generation sequencing reads are typically ignored while they contain biologically relevant information. We systematically analyzed unmapped reads from whole genome sequencing of 33 inbred rat strains. High quality reads were selected and enriched for biologically relevant sequences; similarity-based analysis revealed clustering similar to previously reported phylogenetic trees. Our results demonstrate that on average 20% of all unmapped reads harbor sequences that can be used to improve reference genomes and generate hypotheses on potential genotype-phenotype relationships. Analysis pipelines would benefit from incorporating the described methods and reference genomes would benefit from inclusion of the genomic segments obtained through these efforts.

摘要

未映射的下一代测序读数通常会被忽略，尽管它们包含生物学相关信息。我们系统地分析了33个近交系大鼠品系全基因组测序中的未映射读数。选择高质量读数并富集生物学相关序列；基于相似性的分析揭示了与先前报道的系统发育树相似的聚类。我们的结果表明，平均而言，所有未映射读数中有20%包含可用于改进参考基因组并生成潜在基因型-表型关系假设的序列。分析流程将受益于纳入所述方法，而参考基因组将受益于纳入通过这些努力获得的基因组片段。