• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

注意空白:在基因组分析中忽略难以获取的区域会使统计检验产生偏差。

Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis.

机构信息

Department of Informatics, University of Oslo, Oslo, Norway.

K. G. Jebsen Coeliac Disease Research Centre, Oslo, Norway.

出版信息

BMC Bioinformatics. 2018 Dec 14;19(1):481. doi: 10.1186/s12859-018-2438-1.

DOI:10.1186/s12859-018-2438-1
PMID:30547739
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6293655/
Abstract

BACKGROUND

The current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize that ignoring these regions in the null model may increase false findings in statistical testing of colocalization of genomic features.

RESULTS

Our explorative analyses confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps of human reference genomes (hg19 and hg38). The little intersection was observed only at the beginning and end portions of the gap regions. Further, we simulated a set of synthetic tracks by matching the properties of real genomic tracks in a way that nullified any true association between them. This allowed us to test our hypothesis that not avoiding inaccessible regions (as represented by assembly gaps) in the null model would result in spurious inflation of statistical significance. We contrasted the distributions of test statistics and p-values of Monte Carlo-based permutation tests that either avoided or did not avoid assembly gaps in the null model when testing colocalization between a pair of tracks. We observed that the statistical tests that did not account for assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribution of p-values that is shifted to the left (indicating inflated significance). We observed a similar level of inflated significance in hg19 and hg38, despite assembly gaps covering a smaller proportion of the latter reference genome.

CONCLUSION

We provide empirical evidence demonstrating that inaccessible regions, even when covering only a few percentages of the genome, can lead to a substantial amount of false findings if not accounted for in statistical colocalization analysis.

摘要

背景

当前版本的参考基因组组装仍然包含由 N 组成的大片段缺口。由于高通量测序reads 无法映射到这些缺口区域,这些区域缺乏实验数据。此外,有几种技术平台检测基因组序列的靶向部分,这意味着无法在这些实验中检测到基因组序列未检测部分的区域。我们将所有这些区域都称为不可及区域,并假设在统计测试中忽略这些区域的无效模型可能会增加基因组特征共定位的假阳性发现。

结果

我们的探索性分析证实,公共基因组轨迹中的基因组区域与人类参考基因组(hg19 和 hg38)的组装缺口相交甚少。这种少量的交集仅在缺口区域的开始和结束部分观察到。此外,我们通过匹配真实基因组轨迹的属性来模拟一组合成轨迹,从而使它们之间的任何真实关联无效。这使我们能够测试我们的假设,即在无效模型中不避免不可及区域(如组装缺口所代表的)会导致统计显著性的虚假膨胀。我们对比了在测试一对轨迹之间的共定位时,无效模型中是否避免组装缺口的基于蒙特卡罗置换检验的检验统计量和 p 值的分布。我们观察到,在无效模型中不考虑组装缺口的统计检验导致检验统计量的分布向右移,p 值的分布向左移(表明显著性膨胀)。我们在 hg19 和 hg38 中观察到了类似水平的显著性膨胀,尽管后者参考基因组中的组装缺口仅覆盖了基因组的一小部分。

结论

我们提供了经验证据,证明不可及区域,即使仅覆盖基因组的几个百分比,如果在统计共定位分析中不考虑这些区域,也可能导致大量的假阳性发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/fc12b4ada05f/12859_2018_2438_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/3d7db65d89f6/12859_2018_2438_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/3d7db65d89f6/12859_2018_2438_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/fc12b4ada05f/12859_2018_2438_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/3d7db65d89f6/12859_2018_2438_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/3d7db65d89f6/12859_2018_2438_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d238/6293655/fc12b4ada05f/12859_2018_2438_Fig3_HTML.jpg

相似文献

1
Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis.注意空白:在基因组分析中忽略难以获取的区域会使统计检验产生偏差。
BMC Bioinformatics. 2018 Dec 14;19(1):481. doi: 10.1186/s12859-018-2438-1.
2
Similarities and differences between variants called with human reference genome HG19 or HG38.与使用人类参考基因组 HG19 或 HG38 调用的变体之间的相似性和差异。
BMC Bioinformatics. 2019 Mar 14;20(Suppl 2):101. doi: 10.1186/s12859-019-2620-0.
3
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
4
Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads.马参考序列与其桑格测序源数据及新的Illumina测序读段的比较。
PLoS One. 2015 Jun 24;10(6):e0126852. doi: 10.1371/journal.pone.0126852. eCollection 2015.
5
Annotation of suprachromosomal families reveals uncommon types of alpha satellite organization in pericentromeric regions of hg38 human genome assembly.超染色体家族注释揭示了hg38人类基因组组装着丝粒周围区域罕见的α卫星组织类型。
Genom Data. 2015 Sep 1;5:139-146. doi: 10.1016/j.gdata.2015.05.035.
6
Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences.填补人类参考基因组缺口:识别和表征缺口填补序列
G3 (Bethesda). 2020 Aug 5;10(8):2801-2809. doi: 10.1534/g3.120.401280.
7
misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads.misFinder:使用参考序列和双端读段以无偏倚的方式识别错误组装。
BMC Bioinformatics. 2015 Nov 16;16:386. doi: 10.1186/s12859-015-0818-3.
8
How complete are "complete" genome assemblies?-An avian perspective.“完整”基因组组装的完整性如何?——鸟类视角。
Mol Ecol Resour. 2018 Nov;18(6):1188-1195. doi: 10.1111/1755-0998.12933. Epub 2018 Aug 16.
9
Birth of an 'Asian cool' reference genome: AK1.“亚洲人参考基因组”AK1的诞生。
BMB Rep. 2016 Dec;49(12):653-654. doi: 10.5483/bmbrep.2016.49.12.195.
10
ColoWeb: a resource for analysis of colocalization of genomic features.ColoWeb:一个用于分析基因组特征共定位的资源。
BMC Genomics. 2015 Feb 28;16(1):142. doi: 10.1186/s12864-015-1345-3.

引用本文的文献

1
Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.考虑基因组背景的注释共定位的高效分析
Res Comput Mol Biol. 2024 Apr-May;14758:38-53. doi: 10.1007/978-1-0716-3989-4_3. Epub 2024 May 17.
2
Fast Context-Aware Analysis of Genome Annotation Colocalization.快速上下文感知的基因组注释共定位分析。
J Comput Biol. 2024 Oct;31(10):946-964. doi: 10.1089/cmb.2024.0667. Epub 2024 Oct 9.
3
rGREAT: an R/bioconductor package for functional enrichment on genomic regions.rGREAT:一个用于基因组区域功能富集的 R/bioconductor 包。

本文引用的文献

1
Coloc-stats: a unified web interface to perform colocalization analysis of genomic features.Coloc-stats:一个用于进行基因组特征共定位分析的统一网络界面。
Nucleic Acids Res. 2018 Jul 2;46(W1):W186-W193. doi: 10.1093/nar/gky474.
2
GIGGLE: a search engine for large-scale integrated genome analysis.GIGGLE:一个用于大规模综合基因组分析的搜索引擎。
Nat Methods. 2018 Feb;15(2):123-126. doi: 10.1038/nmeth.4556. Epub 2018 Jan 8.
3
StereoGene: rapid estimation of genome-wide correlation of continuous or interval feature data.StereoGene:快速估计连续或区间特征数据的全基因组相关性。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac745.
4
Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise.利用极乐鸟的多平台基因组组装来识别组装间隙的原因和后果。
Mol Ecol Resour. 2021 Jan;21(1):263-286. doi: 10.1111/1755-0998.13252. Epub 2020 Oct 10.
5
An improved de novo genome assembly of the common marmoset genome yields improved contiguity and increased mapping rates of sequence data.对普通狨猴基因组进行改良的从头基因组组装提高了序列数据的连续性和映射率。
BMC Genomics. 2020 Apr 2;21(Suppl 3):243. doi: 10.1186/s12864-020-6657-2.
6
Colocalization analyses of genomic elements: approaches, recommendations and challenges.基因组元件的共定位分析:方法、建议和挑战。
Bioinformatics. 2019 May 1;35(9):1615-1624. doi: 10.1093/bioinformatics/bty835.
Bioinformatics. 2017 Oct 15;33(20):3158-3165. doi: 10.1093/bioinformatics/btx379.
4
GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.GSuite HyperBrowser:跨基因组和表观基因组数据集集合的综合分析。
Gigascience. 2017 Jul 1;6(7):1-12. doi: 10.1093/gigascience/gix032.
5
Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.对GRCh38和从头单倍体基因组组装的评估证明了参考组装的持久质量。
Genome Res. 2017 May;27(5):849-864. doi: 10.1101/gr.213611.116. Epub 2017 Apr 10.
6
LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor.LOLA:R和Bioconductor中基因组区域集和调控元件的富集分析。
Bioinformatics. 2016 Feb 15;32(4):587-9. doi: 10.1093/bioinformatics/btv612. Epub 2015 Oct 27.
7
BEDTools: The Swiss-Army Tool for Genome Feature Analysis.BEDTools:用于基因组特征分析的瑞士军刀工具。
Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34. doi: 10.1002/0471250953.bi1112s47.
8
The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment.在估计全基因组富集的统计显著性时,选择理想排列策略的困境。
Brief Bioinform. 2014 Nov;15(6):919-28. doi: 10.1093/bib/bbt053. Epub 2013 Aug 16.
9
An integrated encyclopedia of DNA elements in the human genome.人类基因组中 DNA 元件的综合百科全书。
Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.
10
Exploring massive, genome scale datasets with the GenometriCorr package.使用 GenometriCorr 包探索大规模基因组数据集。
PLoS Comput Biol. 2012 May;8(5):e1002529. doi: 10.1371/journal.pcbi.1002529. Epub 2012 May 31.