DistMap：一个在 Hadoop 集群上进行分布式短读映射的工具包。

DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

机构信息

Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria.

出版信息

PLoS One. 2013 Aug 23;8(8):e72614. doi: 10.1371/journal.pone.0072614. eCollection 2013.

DOI:10.1371/journal.pone.0072614

PMID:24009693

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3751911/

Abstract

With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/

摘要

随着下一代测序数据输出的快速稳定增长，短读段的映射已成为数据分析的主要瓶颈。在单台计算机上，映射单个 Illumina HiSeq 泳道产生的大量读段可能需要数天时间。为了缓解这一瓶颈，我们提出了一种新工具，即 DistMap——一种在 Hadoop 分布式计算框架中映射读段的模块化、可扩展和集成的工作流程。DistMap 易于使用，目前支持九种不同的短读段映射工具，可在所有基于 Unix 的操作系统上运行。它接受 FASTQ 格式的读段作为输入，并以 SAM/BAM 格式提供映射后的读段。DistMap 同时支持双端和单端读段，从而允许映射来自不同测序平台的读段数据。DistMap 可从 http://code.google.com/p/distmap/ 获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6d75/3751911/f64ad095bf6d/pone.0072614.g001.jpg

相似文献

DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.DistMap：一个在 Hadoop 集群上进行分布式短读映射的工具包。

PLoS One. 2013 Aug 23;8(8):e72614. doi: 10.1371/journal.pone.0072614. eCollection 2013.

HSRA: Hadoop-based spliced read aligner for RNA sequencing data.HSRA：基于 Hadoop 的 RNA 测序数据拼接读取比对工具。

PLoS One. 2018 Jul 31;13(7):e0201483. doi: 10.1371/journal.pone.0201483. eCollection 2018.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE：一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。

PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.一种用于长读段插入/缺失和替换错误的混合可扩展纠错算法。

BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.

Grape RNA-Seq analysis pipeline environment.葡萄 RNA-Seq 分析管道环境。

Bioinformatics. 2013 Mar 1;29(5):614-21. doi: 10.1093/bioinformatics/btt016. Epub 2013 Jan 17.

Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework.使用MapReduce框架进行从头基因组组装时对高深度下一代测序读数的子集选择。

BMC Genomics. 2015;16 Suppl 12(Suppl 12):S9. doi: 10.1186/1471-2164-16-S12-S9. Epub 2015 Dec 9.

UNDR ROVER - a fast and accurate variant caller for targeted DNA sequencing.UNDR ROVER——一种用于靶向DNA测序的快速且准确的变异检测工具。

BMC Bioinformatics. 2016 Apr 16;17:165. doi: 10.1186/s12859-016-1014-9.

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.Hadoop-BAM：在云中直接操作下一代测序数据。

Bioinformatics. 2012 Mar 15;28(6):876-7. doi: 10.1093/bioinformatics/bts054. Epub 2012 Feb 2.

ClinQC: a tool for quality control and cleaning of Sanger and NGS data in clinical research.ClinQC：临床研究中用于Sanger测序和二代测序（NGS）数据质量控制与清理的工具

BMC Bioinformatics. 2016 Feb 2;17:56. doi: 10.1186/s12859-016-0915-y.

ngs_backbone: a pipeline for read cleaning, mapping and SNP calling using next generation sequence.NGS 骨干：一种使用下一代测序进行读段清理、比对和 SNP 调用的流水线。

BMC Genomics. 2011 Jun 2;12:285. doi: 10.1186/1471-2164-12-285.

引用本文的文献

Spatial transcriptomic applications in orthopedics.空间转录组学在骨科中的应用。

Connect Tissue Res. 2025 Jul;66(4):227-238. doi: 10.1080/03008207.2025.2501703. Epub 2025 May 10.

Spatial and temporal gene expression patterns during early human odontogenesis process.人类早期牙胚发生过程中的时空基因表达模式。

Front Bioeng Biotechnol. 2024 Jul 16;12:1437426. doi: 10.3389/fbioe.2024.1437426. eCollection 2024.

The genomic distribution of transposable elements is driven by spatially variable purifying selection.转座元件的基因组分布是由空间变化的净化选择驱动的。

Nucleic Acids Res. 2023 Sep 22;51(17):9203-9213. doi: 10.1093/nar/gkad635.

Natural variation in Drosophila shows weak pleiotropic effects.果蝇中的自然变异显示出微弱的多效性影响。

Genome Biol. 2022 May 16;23(1):116. doi: 10.1186/s13059-022-02680-4.

Cloud Computing Enabled Big Multi-Omics Data Analytics.基于云计算的大型多组学数据分析

Bioinform Biol Insights. 2021 Jul 28;15:11779322211035921. doi: 10.1177/11779322211035921. eCollection 2021.

GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments.GPrimer：一种用于 qPCR 实验的基于 GPU 的引物设计的快速流水线。

BMC Bioinformatics. 2021 Apr 29;22(1):220. doi: 10.1186/s12859-021-04133-4.

Long-Term Dynamics Among Strains During Thermal Adaptation of Their Hosts.宿主热适应过程中菌株间的长期动态变化。

Front Genet. 2020 May 14;11:482. doi: 10.3389/fgene.2020.00482. eCollection 2020.

IMOS: improved Meta-aligner and Minimap2 On Spark.IMOS：改进的基于 Spark 的 Meta-aligner 和 Minimap2

BMC Bioinformatics. 2019 Jan 24;20(1):51. doi: 10.1186/s12859-018-2592-5.

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.Libra：一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。

Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.

A simple genetic basis of adaptation to a novel thermal environment results in complex metabolic rewiring in Drosophila.简单的遗传基础导致适应新的热环境的结果，在果蝇中产生复杂的代谢重布线。

Genome Biol. 2018 Aug 20;19(1):119. doi: 10.1186/s13059-018-1503-4.

本文引用的文献

STAR: ultrafast universal RNA-seq aligner.STAR：超快通用 RNA-seq 对齐工具。

Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.

Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses.Eoulsan：一个基于云计算的框架，可实现高通量测序分析。

Bioinformatics. 2012 Jun 1;28(11):1542-3. doi: 10.1093/bioinformatics/bts165. Epub 2012 Apr 5.

Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.云生物 Linux：为基因组学社区提供预配置和按需生物信息学计算。

BMC Bioinformatics. 2012 Mar 19;13:42. doi: 10.1186/1471-2105-13-42.

Fast gapped-read alignment with Bowtie 2.快速缺口读对准与 Bowtie 2。

Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.Hadoop-BAM：在云中直接操作下一代测序数据。

Bioinformatics. 2012 Mar 15;28(6):876-7. doi: 10.1093/bioinformatics/bts054. Epub 2012 Feb 2.

FX: an RNA-Seq analysis tool on the cloud.FX：一个云端的 RNA-Seq 分析工具。

Bioinformatics. 2012 Mar 1;28(5):721-3. doi: 10.1093/bioinformatics/bts023. Epub 2012 Jan 17.

SEAL: a distributed short read mapping and duplicate removal tool.SEAL：一种分布式短读映射和去重工具。

Bioinformatics. 2011 Aug 1;27(15):2159-60. doi: 10.1093/bioinformatics/btr325. Epub 2011 Jun 22.

Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.Bismark：用于亚硫酸氢盐测序应用的灵活对齐器和甲基化调用程序。

Bioinformatics. 2011 Jun 1;27(11):1571-2. doi: 10.1093/bioinformatics/btr167. Epub 2011 Apr 14.

PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals.PoPoolation：用于分析来自个体混合群体的下一代测序数据的群体遗传分析工具包。

PLoS One. 2011 Jan 6;6(1):e15925. doi: 10.1371/journal.pone.0015925.

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.Hadoop/MapReduce/HBase 框架概述及其在生物信息学中的当前应用。

BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

DistMap：一个在 Hadoop 集群上进行分布式短读映射的工具包。

DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献