• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 Hadoop Map-Reduce 的基因组序列中 SNPs 检测的快速可扩展工作流。

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.

机构信息

Department of Computer Science, COMSATS University Islamabad, Attock Campus 43600, Pakistan.

出版信息

Genes (Basel). 2020 Feb 5;11(2):166. doi: 10.3390/genes11020166.

DOI:10.3390/genes11020166
PMID:32033366
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7074349/
Abstract

Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.

摘要

下一代测序(NGS)技术产生了大量的生物数据,这带来了各种问题,例如需要处理大量数据和占用大量内存。本研究专注于检测基因组序列中的单核苷酸多态性(SNP)。目前,SNP 检测算法面临着许多问题,例如计算开销成本、准确性和内存需求。在这项研究中,我们提出了一种快速可扩展的工作流程,该流程将 Bowtie 比对器与基于 Hadoop 的 Heap SNP 调用器集成,以提高基因组序列中的 SNPs 检测效率。通过从公共可用的 Web 门户(例如 NCBI 和 DDBJ DRA)获得的基准数据集对所提出的工作流程进行了验证。已经进行了广泛的实验,并将结果与 Bowtie 和 BWA 比对器在比对阶段进行了比较,同时与 GATK、FaSD、SparkGA、Halvade 和 Heap 在 SNP 调用阶段进行了比较。实验结果分析表明,所提出的工作流程优于现有的框架,例如 GATK、FaSD、Heap 与 BWA 和 Bowtie 比对器集成、SparkGA 和 Halvade。所提出的框架在平均 F 分数上提高了 22.46%,准确率提高了 99.80%。此外,还实现了平均 0.21%的更高准确率。此外,还进行了 SNP 挖掘以识别基因组序列中的特定区域。所有框架都采用了内存管理的默认配置进行实现。观察结果表明,所有工作流程的内存需求大致相同。未来,计划以图形方式显示挖掘到的 SNPs,以实现用户友好的交互,分析和优化内存需求。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/c05051f30022/genes-11-00166-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/59d729b78ef0/genes-11-00166-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/367007413163/genes-11-00166-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/37f5355bb4f1/genes-11-00166-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/8ebaf0e34236/genes-11-00166-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/c31cb65616b7/genes-11-00166-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/1a51226ac9ea/genes-11-00166-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/6942d2ef6b0f/genes-11-00166-g007a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/ac394f75c823/genes-11-00166-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/c05051f30022/genes-11-00166-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/59d729b78ef0/genes-11-00166-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/367007413163/genes-11-00166-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/37f5355bb4f1/genes-11-00166-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/8ebaf0e34236/genes-11-00166-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/c31cb65616b7/genes-11-00166-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/1a51226ac9ea/genes-11-00166-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/6942d2ef6b0f/genes-11-00166-g007a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/ac394f75c823/genes-11-00166-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64e/7074349/c05051f30022/genes-11-00166-g009.jpg

相似文献

1
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.基于 Hadoop Map-Reduce 的基因组序列中 SNPs 检测的快速可扩展工作流。
Genes (Basel). 2020 Feb 5;11(2):166. doi: 10.3390/genes11020166.
2
Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data.Heap:一种用于低覆盖度高通量测序数据的高灵敏度和高精度单核苷酸多态性检测工具。
DNA Res. 2017 Aug 1;24(4):397-405. doi: 10.1093/dnares/dsx012.
3
Review of alignment and SNP calling algorithms for next-generation sequencing data.下一代测序数据的比对和单核苷酸多态性(SNP)检测算法综述。
J Appl Genet. 2016 Feb;57(1):71-9. doi: 10.1007/s13353-015-0292-7. Epub 2015 Jun 9.
4
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset.一种用于加速在25个基因组数据集上进行GATK单核苷酸多态性检测的高性能计算工作流程。
BMC Biol. 2024 Jan 25;22(1):13. doi: 10.1186/s12915-024-01820-5.
5
Coverage-based consensus calling (CbCC) of short sequence reads and comparison of CbCC results to identify SNPs in chickpea (Cicer arietinum; Fabaceae), a crop species without a reference genome.基于覆盖度的短序列读取共识调用(CbCC),并将 CbCC 结果与 SNP 进行比较,以鉴定无参考基因组的作物豌豆(Cicer arietinum;豆科)。
Am J Bot. 2012 Feb;99(2):186-92. doi: 10.3732/ajb.1100419. Epub 2012 Feb 1.
6
Faster single-end alignment generation utilizing multi-thread for BWA.利用多线程实现更快的BWA单端比对生成。
Biomed Mater Eng. 2015;26 Suppl 1:S1791-6. doi: 10.3233/BME-151480.
7
ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark:一种可扩展的基于 Spark 的单倍型调用程序,利用自适应数据分段来加速变异调用。
BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.
8
Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.Halvade-RNA:使用MapReduce从转录组数据中并行进行变异检测
PLoS One. 2017 Mar 30;12(3):e0174575. doi: 10.1371/journal.pone.0174575. eCollection 2017.
9
VC@Scale: Scalable and high-performance variant calling on cluster environments.VC@Scale:在集群环境中进行可扩展且高性能的变体调用。
Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.
10
Halvade somatic: Somatic variant calling with Apache Spark.半体变异体调用:基于 Apache Spark 的半体变异体调用。
Gigascience. 2022 Jan 12;11(1). doi: 10.1093/gigascience/giab094.

引用本文的文献

1
A Chamber-Based Digital PCR Based on a Microfluidic Chip for the Absolute Quantification and Analysis of KRAS Mutation.基于微流控芯片的腔室式数字 PCR 绝对定量分析 KRAS 突变。
Biosensors (Basel). 2023 Aug 1;13(8):778. doi: 10.3390/bios13080778.
2
Competitive SNP-LAMP probes for rapid and robust single-nucleotide polymorphism detection.竞争性 SNP-LAMP 探针用于快速、稳健的单核苷酸多态性检测。
Cell Rep Methods. 2022 Jun 13;2(7):100242. doi: 10.1016/j.crmeth.2022.100242. eCollection 2022 Jul 18.
3
Genetic variations analysis for complex brain disease diagnosis using machine learning techniques: opportunities and hurdles.

本文引用的文献

1
ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark:一种可扩展的基于 Spark 的单倍型调用程序,利用自适应数据分段来加速变异调用。
BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.
2
A Model for Distributed Processing and Analyses of NGS Data under Map-Reduce Paradigm.一种基于Map-Reduce范式的下一代测序(NGS)数据分布式处理与分析模型。
IEEE/ACM Trans Comput Biol Bioinform. 2019 May-Jun;16(3):827-840. doi: 10.1109/TCBB.2018.2816022.
3
BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.
使用机器学习技术进行复杂脑部疾病诊断的基因变异分析:机遇与障碍
PeerJ Comput Sci. 2021 Sep 20;7:e697. doi: 10.7717/peerj-cs.697. eCollection 2021.
BAMSI:一个用于大规模基因组数据可扩展分布式过滤的多云服务。
BMC Bioinformatics. 2018 Jun 26;19(1):240. doi: 10.1186/s12859-018-2241-z.
4
MP-LAMP: parallel detection of statistically significant multi-loci markers on cloud platforms.MP-LAMP:在云平台上并行检测具有统计学意义的多基因座标记。
Bioinformatics. 2018 Sep 1;34(17):3047-3049. doi: 10.1093/bioinformatics/bty219.
5
FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods. FusorSV:一种用于最优组合来自多种结构变异检测方法的数据的算法。
Genome Biol. 2018 Mar 20;19(1):38. doi: 10.1186/s13059-018-1404-6.
6
Cloud computing for genomic data analysis and collaboration.云计算在基因组数据分析和协作中的应用。
Nat Rev Genet. 2018 Apr;19(4):208-219. doi: 10.1038/nrg.2017.113. Epub 2018 Jan 30.
7
Fast and cost-effective single nucleotide polymorphism (SNP) detection in the absence of a reference genome using semideep next-generation Random Amplicon Sequencing (RAMseq).利用半深下一代随机扩增多态性测序 (RAMseq) 在没有参考基因组的情况下快速且经济有效地检测单核苷酸多态性 (SNP)。
Mol Ecol Resour. 2018 Jan;18(1):107-117. doi: 10.1111/1755-0998.12717. Epub 2017 Oct 9.
8
GenomeVIP: a cloud platform for genomic variant discovery and interpretation.基因组 VIP:一个用于基因组变异发现和解释的云平台。
Genome Res. 2017 Aug;27(8):1450-1459. doi: 10.1101/gr.211656.116. Epub 2017 May 18.
9
Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data.Heap:一种用于低覆盖度高通量测序数据的高灵敏度和高精度单核苷酸多态性检测工具。
DNA Res. 2017 Aug 1;24(4):397-405. doi: 10.1093/dnares/dsx012.
10
Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.Halvade-RNA:使用MapReduce从转录组数据中并行进行变异检测
PLoS One. 2017 Mar 30;12(3):e0174575. doi: 10.1371/journal.pone.0174575. eCollection 2017.