• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于加速在25个基因组数据集上进行GATK单核苷酸多态性检测的高性能计算工作流程。

A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.

作者信息

Zhou Yong, Kathiresan Nagarajan, Yu Zhichao, Rivera Luis F, Yang Yujian, Thimma Manjula, Manickam Keerthana, Chebotarov Dmytro, Mauleon Ramil, Chougule Kapeel, Wei Sharon, Gao Tingting, Green Carl D, Zuccolo Andrea, Xie Weibo, Ware Doreen, Zhang Jianwei, McNally Kenneth L, Wing Rod A

机构信息

Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.

Arizona Genomics Institute (AGI), School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, USA.

出版信息

BMC Biol. 2024 Jan 25;22(1):13. doi: 10.1186/s12915-024-01820-5.

DOI:10.1186/s12915-024-01820-5
PMID:38273258
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10809545/
Abstract

BACKGROUND

Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable.

RESULTS

Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq).

CONCLUSIONS

This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.

摘要

背景

单核苷酸多态性(SNP)是分子遗传变异研究中使用最广泛的形式。随着参考基因组和重测序数据集呈指数级增长,必须具备能以相似速度识别SNP的工具。基因组分析工具包(GATK)是公开可用的使用最广泛的SNP识别软件工具之一,但遗憾的是,该工具的高性能计算版本尚未广泛可用且价格亲民。

结果

在此,我们报告了一种用于GATK的开源高性能计算基因组变异识别工作流程(HPC-GVCW),它可以在从超级计算机到台式机的多个计算平台上运行。我们在多个作物物种上对HPC-GVCW进行了性能和准确性基准测试,结果与之前发表的报告(仅使用GATK)相当。最后,我们在生产模式下使用HPC-GVCW在一个“亚群感知”的16基因组水稻参考面板上识别SNP,该面板包含约3000个重测序水稻品种。整个过程耗时约16周,平均每个基因组识别出2730万个SNP,并发现了约230万个水稻旗舰参考基因组(即国际水稻基因组测序计划参考序列,IRGSP RefSeq)中不存在的新SNP。

结论

本研究开发了一种开源流程(HPC-GVCW),用于在高性能计算平台上运行GATK,显著提高了SNP的识别速度。该工作流程具有广泛适用性,已成功应用于四种主要作物物种,其基因组大小从400 Mb到2.4 Gb不等。在生产模式下使用HPC-GVCW在一个25个多作物参考基因组数据集上识别SNP,产生了超过11亿个SNP,并已公开发布用于功能和育种研究。对于水稻,识别出了许多新SNP,且发现它们位于预计具有功能后果的基因和开放染色质区域内。综合来看,我们的结果证明了将高性能SNP识别架构解决方案与亚群感知参考基因组面板相结合,对于快速SNP发现和公共应用的有用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/9784e832a324/12915_2024_1820_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/25ec6d2890f8/12915_2024_1820_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/10dbec320ef4/12915_2024_1820_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/9784e832a324/12915_2024_1820_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/25ec6d2890f8/12915_2024_1820_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/10dbec320ef4/12915_2024_1820_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acc/10809545/9784e832a324/12915_2024_1820_Fig3_HTML.jpg

相似文献

1
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.一种用于加速在25个基因组数据集上进行GATK单核苷酸多态性检测的高性能计算工作流程。
BMC Biol. 2024 Jan 25;22(1):13. doi: 10.1186/s12915-024-01820-5.
2
OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow.OVarFlow:一种基于资源优化的 GATK4 的开源变异调用工作流程。
BMC Bioinformatics. 2021 Aug 13;22(1):402. doi: 10.1186/s12859-021-04317-y.
3
SNP-SVant: A Computational Workflow to Predict and Annotate Genomic Variants in Organisms Lacking Benchmarked Variants.SNP-SVant:一种在缺乏基准变异的生物中预测和注释基因组变异的计算工作流程。
Curr Protoc. 2024 May;4(5):e1046. doi: 10.1002/cpz1.1046.
4
A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows advantages of a new workflow, GB-eaSy.对低覆盖作物数据集的测序分析方法的比较表明,一种新的工作流程 GB-eaSy 具有优势。
BMC Bioinformatics. 2017 Dec 28;18(1):586. doi: 10.1186/s12859-017-2000-6.
5
RIG: Recalibration and interrelation of genomic sequence data with the GATK.RIG:利用基因组分析工具包(GATK)对基因组序列数据进行重新校准和相互关联
G3 (Bethesda). 2015 Feb 13;5(4):655-65. doi: 10.1534/g3.115.017012.
6
Fast and accurate DNASeq variant calling workflow composed of LUSH toolkit.由 LUSH 工具包组成的快速准确的 DNA 测序变异调用工作流程。
Hum Genomics. 2024 Oct 10;18(1):114. doi: 10.1186/s40246-024-00666-w.
7
An optimized genomic VCF workflow for precise identification of Mycobacterium tuberculosis cluster from cross-platform whole genome sequencing data.一种优化的基因组 VCF 工作流程,用于从跨平台全基因组测序数据中精确鉴定结核分枝杆菌簇。
Infect Genet Evol. 2020 Apr;79:104152. doi: 10.1016/j.meegid.2019.104152. Epub 2019 Dec 24.
8
An analytical workflow for accurate variant discovery in highly divergent regions.一种用于在高度分化区域进行准确变异发现的分析流程。
BMC Genomics. 2016 Sep 2;17(1):703. doi: 10.1186/s12864-016-3045-z.
9
Evaluation of variant calling tools for large plant genome re-sequencing.评价用于大型植物基因组重测序的变异调用工具。
BMC Bioinformatics. 2020 Aug 17;21(1):360. doi: 10.1186/s12859-020-03704-1.
10
Comparison of seven SNP calling pipelines for the next-generation sequencing data of chickens.比较用于鸡下一代测序数据的七种 SNP 调用管道。
PLoS One. 2022 Jan 31;17(1):e0262574. doi: 10.1371/journal.pone.0262574. eCollection 2022.

引用本文的文献

1
A 17.1 kb duplication downstream GATA6 is strongly associated with egg weight in chicken.GATA6下游17.1 kb的重复与鸡的蛋重密切相关。
BMC Genomics. 2025 Aug 20;26(1):765. doi: 10.1186/s12864-025-11888-0.
2
Three regulatory elements upstream of LMO4 are strongly associated with intermittent fertilization intensity in Chicken.LMO4上游的三个调控元件与鸡的间歇性受精强度密切相关。
Poult Sci. 2025 Mar;104(3):104769. doi: 10.1016/j.psj.2025.104769. Epub 2025 Jan 9.
3
Genetic diversity of Plasmodium falciparum reticulocyte binding protein homologue-5, which is a potential malaria vaccine candidate: baseline data from areas of varying malaria endemicity in Mainland Tanzania.

本文引用的文献

1
GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data.GATK-gCNV 可从外显子测序数据中发现罕见的拷贝数变异。
Nat Genet. 2023 Sep;55(9):1589-1597. doi: 10.1038/s41588-023-01449-0. Epub 2023 Aug 21.
2
A complete telomere-to-telomere assembly of the maize genome.玉米基因组的完整端粒到端粒组装。
Nat Genet. 2023 Jul;55(7):1221-1231. doi: 10.1038/s41588-023-01419-6. Epub 2023 Jun 15.
3
Rice Gene Index: A comprehensive pan-genome database for comparative and functional genomics of Asian rice.
恶性疟原虫网织红细胞结合蛋白同源物5的遗传多样性,该蛋白是一种潜在的疟疾疫苗候选物:坦桑尼亚大陆不同疟疾流行程度地区的基线数据
Malar J. 2025 Jan 27;24(1):29. doi: 10.1186/s12936-025-05269-x.
4
Genome Insights and Identification of Sex Determination Region and Sex Markers in .基因组见解以及[具体物种]中性别决定区域和性别标记的鉴定
Genes (Basel). 2024 Nov 21;15(12):1493. doi: 10.3390/genes15121493.
水稻基因索引:一个用于亚洲水稻比较基因组学和功能基因组学的综合泛基因组数据库。
Mol Plant. 2023 May 1;16(5):798-801. doi: 10.1016/j.molp.2023.03.012. Epub 2023 Mar 24.
4
Pan-genome inversion index reveals evolutionary insights into the subpopulation structure of Asian rice.泛基因组倒位指数揭示了亚洲稻种群结构进化的见解。
Nat Commun. 2023 Mar 21;14(1):1567. doi: 10.1038/s41467-023-37004-y.
5
Emerging Approaches to DNA Data Storage: Challenges and Prospects.新兴的 DNA 数据存储方法:挑战与展望。
ACS Nano. 2022 Nov 22;16(11):17552-17571. doi: 10.1021/acsnano.2c06748. Epub 2022 Oct 18.
6
Sorghum Association Panel whole-genome sequencing establishes cornerstone resource for dissecting genomic diversity.高粱协会全基因组测序小组建立了剖析基因组多样性的基础资源。
Plant J. 2022 Aug;111(3):888-904. doi: 10.1111/tpj.15853. Epub 2022 Jul 5.
7
Genome assembly of the JD17 soybean provides a new reference genome for comparative genomics.JD17 大豆基因组组装为比较基因组学提供了新的参考基因组。
G3 (Bethesda). 2022 Apr 4;12(4). doi: 10.1093/g3journal/jkac017.
8
Comparison of GATK and DeepVariant by trio sequencing.基于 trio 测序的 GATK 和 DeepVariant 比较。
Sci Rep. 2022 Feb 2;12(1):1809. doi: 10.1038/s41598-022-05833-4.
9
Assigning function to SNPs: Considerations when interpreting genetic variation.为单核苷酸多态性分配功能:解读遗传变异时需考虑的因素。
Semin Cell Dev Biol. 2022 Jan;121:135-142. doi: 10.1016/j.semcdb.2021.08.008. Epub 2021 Aug 24.
10
De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes.从头组装、注释和 26 个不同玉米基因组的比较分析。
Science. 2021 Aug 6;373(6555):655-662. doi: 10.1126/science.abg5289.