利用来自下一代测序的混合样本发现人类参考基因组中缺失的常见序列。

Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing.

作者信息

Liu Yu, Koyutürk Mehmet, Maxwell Sean, Xiang Min, Veigl Martina, Cooper Richard S, Tayo Bamidele O, Li Li, LaFramboise Thomas, Wang Zhenghe, Zhu Xiaofeng, Chance Mark R

机构信息

Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA.

出版信息

BMC Genomics. 2014 Aug 16;15(1):685. doi: 10.1186/1471-2164-15-685.

DOI:10.1186/1471-2164-15-685

PMID:25129063

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4148959/

Abstract

BACKGROUND

Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations.

RESULTS

To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity.

CONCLUSIONS

76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.

摘要

背景

已发现个体基因组中存在长达数兆碱基的序列，但在人类参考基因组中却不存在。这些序列在人群中可能很常见，而它们在参考基因组中的缺失可能表明作为人类基因组计划供体的个体基因组中存在罕见变异。由于参考基因组用于微阵列技术的探针设计以及下一代测序（NGS）中的短读段映射，这种缺失序列可能是功能基因组研究和变异分析中偏差的一个来源。单端锚定（OEA）和/或双端测序中的孤儿读段已被用于识别参考基因组中不存在的新序列。然而，尚无研究调查这些序列在人群中的分布、进化和功能。

结果

为了系统地识别和研究缺失的常见序列（micSeqs），我们扩展了先前的方法，通过汇集大量个体的OEA读段并应用严格的过滤方法来去除假序列。该流程应用于千人基因组计划第一阶段的数据。我们鉴定出309个micSeqs，它们存在于至少1%的人类群体中，但在参考基因组中不存在。通过与其他灵长类基因组、个体人类基因组和基因表达数据进行比较，我们证实了这309个micSeqs中的76%。此外，我们随机选择了15个micSeqs，并在另外38个个体中使用PCR验证确认了它们的存在。使用已发表的RNA-seq和ChIP-seq数据进行的功能分析表明，11个micSeqs在人类大脑中高度表达，3个micSeqs包含转录因子（TF）结合区域，表明它们是功能元件。此外，鉴定出的micSeqs在非灵长类动物中不存在，并在灵长类进化过程中呈现动态获得，最终大多数micSeqs存在于非洲人群中，这表明一些micSeqs可能是人类多样性的重要来源。

结论

通过比较基因组学方法证实了76%的micSeqs。14个micSeqs在人类大脑中表达或包含TF结合区域。一些micSeqs是灵长类特有的、保守的，可能在灵长类进化中发挥作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49a0/4148959/06870a2c776f/12864_2014_6377_Fig1_HTML.jpg

相似文献

Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing.利用来自下一代测序的混合样本发现人类参考基因组中缺失的常见序列。

BMC Genomics. 2014 Aug 16;15(1):685. doi: 10.1186/1471-2164-15-685.

Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence.基于注释的全基因组 SNP 发现利用下一代测序技术在没有参考基因组序列的情况下在大型复杂的粗山羊草基因组中

BMC Genomics. 2011 Jan 25;12:59. doi: 10.1186/1471-2164-12-59.

Anchored pseudo-de novo assembly of human genomes identifies extensive sequence variation from unmapped sequence reads.人类基因组的锚定伪从头组装可从未映射的序列读取中识别出广泛的序列变异。

Hum Genet. 2016 Jul;135(7):727-40. doi: 10.1007/s00439-016-1667-5. Epub 2016 Apr 9.

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.利用来自小型真核生物基因组的模拟读数对单核苷酸多态性假阳性原因的调查。

BMC Bioinformatics. 2015 Nov 11;16:382. doi: 10.1186/s12859-015-0801-z.

A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals.对1000基因组计划个体进行的等位基因特异性结合和表达的统一调查。

Nat Commun. 2016 Apr 18;7:11101. doi: 10.1038/ncomms11101.

Insertion variants missing in the human reference genome are widespread among human populations.人类参考基因组中缺失的插入变异在人群中广泛存在。

BMC Biol. 2020 Nov 13;18(1):167. doi: 10.1186/s12915-020-00894-1.

A Catalogue of 59,732 Human-Specific Regulatory Sequences Reveals Unique-to-Human Regulatory Patterns Associated with Virus-Interacting Proteins, Pluripotency, and Brain Development.59732 个人类特异性调控序列目录揭示了与病毒相互作用蛋白、多能性和脑发育相关的人类特有的调控模式。

DNA Cell Biol. 2020 Jan;39(1):126-143. doi: 10.1089/dna.2019.4988. Epub 2019 Nov 15.

HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data.HiChIP：一种用于 ChIP-Seq 数据综合分析的高通量管道。

BMC Bioinformatics. 2014 Aug 15;15(1):280. doi: 10.1186/1471-2105-15-280.

Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: applied to turkey.利用第二代高通量测序技术在未测序基因组中进行大规模单核苷酸多态性发现：应用于火鸡。

BMC Genomics. 2009 Oct 16;10:479. doi: 10.1186/1471-2164-10-479.

Towards a reference genome that captures global genetic diversity.朝着捕获全球遗传多样性的参考基因组迈进。

Nat Commun. 2020 Oct 30;11(1):5482. doi: 10.1038/s41467-020-19311-w.

引用本文的文献

Constructing a draft Indian cattle pangenome using short-read sequencing.利用短读长测序构建印度牛泛基因组草图。

Commun Biol. 2025 Apr 13;8(1):605. doi: 10.1038/s42003-025-07978-0.

Whole Exome-Sequencing of Pooled Genomic DNA Samples to Detect Quantitative Trait Loci in Esotropia and Exotropia of Strabismus in Japanese.对日本斜视患者内斜视和外斜视的混合基因组DNA样本进行全外显子组测序以检测数量性状位点

Life (Basel). 2021 Dec 27;12(1):41. doi: 10.3390/life12010041.

Probably Correct: Rescuing Repeats with Short and Long Reads.可能正确：使用短读长读来拯救重复序列。

Genes (Basel). 2020 Dec 31;12(1):48. doi: 10.3390/genes12010048.

CHOP: haplotype-aware path indexing in population graphs.CHOP：群体图中的单倍型感知路径索引。

Genome Biol. 2020 Mar 11;21(1):65. doi: 10.1186/s13059-020-01963-y.

Towards the Complete Goat Pan-Genome by Recovering Missing Genomic Segments From the Reference Genome.通过从参考基因组中恢复缺失的基因组片段构建完整的山羊泛基因组

Front Genet. 2019 Nov 15;10:1169. doi: 10.3389/fgene.2019.01169. eCollection 2019.

The phylogeny of 48 alleles, experimentally verified at 21 kb, and its application to clinical allele detection.48 个经实验验证的 21kb 等位基因的系统发育及其在临床等位基因检测中的应用。

J Transl Med. 2019 Feb 11;17(1):43. doi: 10.1186/s12967-019-1791-9.

Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs.组装和分析未映射的基因组序列读段揭示了狗的新序列和变异。

Sci Rep. 2018 Jul 18;8(1):10862. doi: 10.1038/s41598-018-29190-3.

Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population.人类DNA、mRNA和蛋白质参考序列之间的差异及其与人类群体中单核苷酸变异的关系。

Database (Oxford). 2016 Sep 1;2016. doi: 10.1093/database/baw124. Print 2016.

The Genomic Scrapheap Challenge; Extracting Relevant Data from Unmapped Whole Genome Sequencing Reads, Including Strain Specific Genomic Segments, in Rats.基因组垃圾堆挑战：从大鼠未映射的全基因组测序读数（包括菌株特异性基因组片段）中提取相关数据

PLoS One. 2016 Aug 8;11(8):e0160036. doi: 10.1371/journal.pone.0160036. eCollection 2016.

Genomic leftovers: identifying novel microsatellites, over-represented motifs and functional elements in the human genome.基因组残余物：识别人类基因组中的新型微卫星、过度代表的基序和功能元件。

Sci Rep. 2016 Jun 9;6:27722. doi: 10.1038/srep27722.

本文引用的文献

Deciphering the functions and regulation of brain-enriched A-to-I RNA editing.解析富含大脑的 A-to-I RNA 编辑的功能和调控。

Nat Neurosci. 2013 Nov;16(11):1518-22. doi: 10.1038/nn.3539. Epub 2013 Oct 28.

VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data.VirusFinder：一种通过下一代测序数据高效准确地检测病毒及其在宿主基因组中整合位点的软件。

PLoS One. 2013 May 24;8(5):e64465. doi: 10.1371/journal.pone.0064465. Print 2013.

DNA analysis of an early modern human from Tianyuan Cave, China.中国甜元洞早期现代人的 DNA 分析。

Proc Natl Acad Sci U S A. 2013 Feb 5;110(6):2223-7. doi: 10.1073/pnas.1221359110. Epub 2013 Jan 22.

Widespread splicing changes in human brain development and aging.人类大脑发育和衰老过程中的广泛剪接变化。

Mol Syst Biol. 2013;9:633. doi: 10.1038/msb.2012.67.

Widespread horizontal transfer of retrotransposons.转座子的广泛水平转移。

Proc Natl Acad Sci U S A. 2013 Jan 15;110(3):1012-6. doi: 10.1073/pnas.1205856110. Epub 2012 Dec 31.

An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

Landscape of transcription in human cells.人类细胞中的转录景观。

Nature. 2012 Sep 6;489(7414):101-8. doi: 10.1038/nature11233.

An integrated encyclopedia of DNA elements in the human genome.人类基因组中 DNA 元件的综合百科全书。

Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.

RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics.RobiNA：一个基于 RNA-Seq 的转录组学的用户友好、集成的软件解决方案。

Nucleic Acids Res. 2012 Jul;40(Web Server issue):W622-7. doi: 10.1093/nar/gks540. Epub 2012 Jun 8.

Fast gapped-read alignment with Bowtie 2.快速缺口读对准与 Bowtie 2。

Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用来自下一代测序的混合样本发现人类参考基因组中缺失的常见序列。

Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献