• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基因组学时代的下一代数据过滤。

Next-generation data filtering in the genomics era.

机构信息

Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.

Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA.

出版信息

Nat Rev Genet. 2024 Nov;25(11):750-767. doi: 10.1038/s41576-024-00738-6. Epub 2024 Jun 14.

DOI:10.1038/s41576-024-00738-6
PMID:38877133
Abstract

Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering - removing sequencing bases, reads, genetic variants and/or individuals from a dataset - to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy-Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima's D value, population differentiation (F), nucleotide diversity (π) and effective population size (N).

摘要

基因组数据在从农业到生物多样性、生态学、进化和人类健康等各个领域都无处不在。然而,这些数据集通常包含噪声或错误,并且缺少信息,这可能会影响后续计算分析和结论的准确性和可靠性。基因组数据分析的关键步骤之一是过滤-从数据集中去除测序碱基、读取、遗传变异和/或个体-以提高下游分析的数据质量。研究人员在过滤基因组数据时面临着众多选择;他们必须选择要应用的过滤器并选择适当的阈值。为了帮助迎来下一代基因组数据过滤,我们审查并建议改进常用过滤类型和阈值的实施、可重复性和报告标准的最佳实践,这些过滤类型和阈值通常应用于基因组数据集。我们主要关注用于次要等位基因频率、每个个体或每个基因座的缺失数据、连锁不平衡和 Hardy-Weinberg 偏离的过滤器。使用模拟和经验数据集,我们说明了不同过滤阈值对常见群体遗传学统计数据(如 Tajima 的 D 值、种群分化(F)、核苷酸多样性(π)和有效种群大小(N))的影响。

相似文献

1
Next-generation data filtering in the genomics era.基因组学时代的下一代数据过滤。
Nat Rev Genet. 2024 Nov;25(11):750-767. doi: 10.1038/s41576-024-00738-6. Epub 2024 Jun 14.
2
Population genomics from pool sequencing.群体基因组学的混合测序研究。
Mol Ecol. 2013 Nov;22(22):5561-76. doi: 10.1111/mec.12522. Epub 2013 Oct 28.
3
Recent novel approaches for population genomics data analysis.群体基因组学数据分析的最新新颖方法。
Mol Ecol. 2014 Apr;23(7):1661-7. doi: 10.1111/mec.12686.
4
Minor allele frequency thresholds strongly affect population structure inference with genomic data sets.等位基因频率阈值会强烈影响基因组数据集的群体结构推断。
Mol Ecol Resour. 2019 May;19(3):639-647. doi: 10.1111/1755-0998.12995.
5
grenedalf: population genetic statistics for the next generation of pool sequencing.格伦代尔:下一代池测序的群体遗传统计。
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae508.
6
Next-generation sequencing reveals new information about HLA allele and haplotype diversity in a large European American population.下一代测序揭示了大型欧洲裔人群中 HLA 等位基因和单倍型多样性的新信息。
Hum Immunol. 2019 Oct;80(10):807-822. doi: 10.1016/j.humimm.2019.07.275. Epub 2019 Jul 22.
7
BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.BAMSI:一个用于大规模基因组数据可扩展分布式过滤的多云服务。
BMC Bioinformatics. 2018 Jun 26;19(1):240. doi: 10.1186/s12859-018-2241-z.
8
The effects of demography and long-term selection on the accuracy of genomic prediction with sequence data.人口统计学和长期选择对基于序列数据的基因组预测准确性的影响。
Genetics. 2014 Dec;198(4):1671-84. doi: 10.1534/genetics.114.168344. Epub 2014 Sep 18.
9
Data Management and Summary Statistics with PLINK.PLINK 中的数据管理和汇总统计
Methods Mol Biol. 2020;2090:49-65. doi: 10.1007/978-1-0716-0199-0_3.
10
Regarding the F-word: The effects of data filtering on inferred genotype-environment associations.关于 F 字:数据过滤对推断的基因型-环境关联的影响。
Mol Ecol Resour. 2021 Jul;21(5):1460-1474. doi: 10.1111/1755-0998.13351. Epub 2021 Mar 9.

引用本文的文献

1
Genomic analysis of differentiation and demography of the formerly conspecific agile (Dipodomys agilis) and Dulzura (D. simulans) kangaroo rats.对曾经同种的敏捷更格卢鼠(Dipodomys agilis)和杜尔祖拉更格卢鼠(D. simulans)的分化及种群统计学的基因组分析。
Heredity (Edinb). 2025 Aug 25. doi: 10.1038/s41437-025-00789-3.
2
'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions.“高信息量”基因标记可能会使结论产生偏差:示例与通用解决方案
Mol Ecol Resour. 2025 Oct;25(7):e70011. doi: 10.1111/1755-0998.70011. Epub 2025 Jul 11.
3
Whole genome sequences of 297 Duolang sheep for litter size.

本文引用的文献

1
Estimates of heterozygosity from single nucleotide polymorphism markers are context-dependent and often wrong.基于单核苷酸多态性标记的杂合度估计是依赖于背景的,而且往往是错误的。
Mol Ecol Resour. 2024 May;24(4):e13947. doi: 10.1111/1755-0998.13947. Epub 2024 Mar 3.
2
A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics.一种用于群体基因组学的快速、可重现、高通量的变异calling 工作流程。
Mol Biol Evol. 2024 Jan 3;41(1). doi: 10.1093/molbev/msad270.
3
Demographic history shapes North American gray wolf genomic diversity and informs species' conservation.
297只多浪羊产羔数的全基因组序列
Sci Data. 2025 Jul 1;12(1):1086. doi: 10.1038/s41597-025-05448-0.
4
PISAD: reference-free intraspecies sample anomalies detection tool based on k-mer counting.PISAD:基于k-mer计数的无参考种内样本异常检测工具。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf061.
5
Drivers of genetic diversity across the marine tree of life.海洋生命树中遗传多样性的驱动因素。
bioRxiv. 2025 Jun 6:2025.06.03.657718. doi: 10.1101/2025.06.03.657718.
6
Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.
7
From air to insight: the evolution of airborne DNA sequencing technologies.从空中到洞察:机载DNA测序技术的演变
Microbiology (Reading). 2025 May;171(5). doi: 10.1099/mic.0.001564.
8
Methodological opportunities in genomic data analysis to advance health equity.基因组数据分析中促进健康公平的方法学机遇。
Nat Rev Genet. 2025 May 15. doi: 10.1038/s41576-025-00839-w.
9
Novel genetic association with migratory diapause in Australian monarch butterflies.澳大利亚黑脉金斑蝶迁徙滞育的新型基因关联
BMC Ecol Evol. 2025 May 7;25(1):43. doi: 10.1186/s12862-025-02384-w.
10
Profiling of runs of homozygosity from whole-genome sequence data in Japanese biobank.日本生物样本库中全基因组序列数据的纯合子片段分析。
J Hum Genet. 2025 Jun;70(6):287-296. doi: 10.1038/s10038-025-01331-3. Epub 2025 Apr 3.
人口历史塑造了北美的灰狼基因组多样性,并为物种保护提供了信息。
Mol Ecol. 2024 Feb;33(3):e17231. doi: 10.1111/mec.17231. Epub 2023 Dec 6.
4
Contrasting whole-genome and reduced representation sequencing for population demographic and adaptive inference: an alpine mammal case study.全基因组和简化基因组测序在群体遗传和适应性推断中的对比:以高山哺乳动物为例的研究。
Heredity (Edinb). 2023 Oct;131(4):273-281. doi: 10.1038/s41437-023-00643-4. Epub 2023 Aug 2.
5
Design, execution, and interpretation of plant RNA-seq analyses.植物RNA测序分析的设计、执行与解读
Front Plant Sci. 2023 Jun 30;14:1135455. doi: 10.3389/fpls.2023.1135455. eCollection 2023.
6
Inference of the distribution of fitness effects of mutations is affected by single nucleotide polymorphism filtering methods, sample size and population structure.突变适应度效应分布的推断受到单核苷酸多态性过滤方法、样本量和群体结构的影响。
Mol Ecol Resour. 2023 Oct;23(7):1589-1603. doi: 10.1111/1755-0998.13825. Epub 2023 Jun 20.
7
A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data.基于 k- -mer 的环境基因组数据中分类单元的系统发育分类方法。
Syst Biol. 2023 Nov 1;72(5):1101-1118. doi: 10.1093/sysbio/syad037.
8
On the causes, consequences, and avoidance of PCR duplicates: Towards a theory of library complexity.关于 PCR 重复的原因、后果和避免:构建文库复杂度理论。
Mol Ecol Resour. 2023 Aug;23(6):1299-1318. doi: 10.1111/1755-0998.13800. Epub 2023 Apr 16.
9
Variant calling and benchmarking in an era of complete human genome sequences.全基因组序列时代的变异调用和基准测试。
Nat Rev Genet. 2023 Jul;24(7):464-483. doi: 10.1038/s41576-023-00590-0. Epub 2023 Apr 14.
10
Inbreeding depression explains killer whale population dynamics.近亲繁殖衰退解释了虎鲸种群动态。
Nat Ecol Evol. 2023 May;7(5):675-686. doi: 10.1038/s41559-023-01995-0. Epub 2023 Mar 20.