• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SeqPig:Hadoop 中用于大型测序数据集的简单且可扩展的脚本编制。

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

机构信息

Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, International Computer Science Institute, Berkeley, CA, USA, CRS4-Center for Advanced Studies, Research and Development in Sardinia, Italy and CSC-IT Center for Science, Finland.

出版信息

Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.

DOI:10.1093/bioinformatics/btt601
PMID:24149054
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3866557/
Abstract

SUMMARY

Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.

AVAILABILITY AND IMPLEMENTATION

Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

摘要

摘要

由于其在处理大型测序数据集方面的可扩展性,基于 Hadoop MapReduce 的方法变得越来越流行。然而,由于这些方法通常需要深入了解 Hadoop 和 Java,因此它们仍然超出了许多生物信息学家的能力范围。为了解决这个问题,我们创建了 SeqPig,这是一个库和一组工具,用于以可扩展和简单的方式操作、分析和查询测序数据集。SeqPigscripts 使用基于 Hadoop 的分布式脚本引擎 Apache Pig,它可以自动并行化和分发数据处理任务。我们展示了 SeqPig 在许多计算节点上的可扩展性,并通过示例脚本说明了它的用法。

可用性和实现

可在开源 MIT 许可证下从 http://sourceforge.net/projects/seqpig/ 获取

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/0ff867099d91/btt601f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/e92221984c08/btt601f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/12af5cbeb3f7/btt601f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/0ff867099d91/btt601f3p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/e92221984c08/btt601f1p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/12af5cbeb3f7/btt601f2p.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbf6/3866557/0ff867099d91/btt601f3p.jpg

相似文献

1
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.SeqPig:Hadoop 中用于大型测序数据集的简单且可扩展的脚本编制。
Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.
2
CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE:一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。
PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.
3
SeqWare Query Engine: storing and searching sequence data in the cloud.SeqWare 查询引擎:在云端存储和搜索序列数据。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2.
4
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.MapReduce 编程框架在临床大数据分析中的应用:现状与未来趋势。
BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014.
5
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.SparkSeq:一种快速、可扩展且适用于云环境的工具,可实现具有核苷酸精度的交互式基因组数据分析。
Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19.
6
MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud.MarDRe:基于MapReduce在云端高效去除重复DNA读数。
Bioinformatics. 2017 Sep 1;33(17):2762-2764. doi: 10.1093/bioinformatics/btx307.
7
Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.Hadoop-GIS:一种基于MapReduce的高性能空间数据仓库系统。
Proceedings VLDB Endowment. 2013 Aug;6(11).
8
Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.Hadoop-GIS演示:一种基于MapReduce的空间数据仓库系统
Proc ACM SIGSPATIAL Int Conf Adv Inf. 2013 Nov;2013:528-531. doi: 10.1145/2525314.2525320.
9
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.Hadoop/MapReduce/HBase 框架概述及其在生物信息学中的当前应用。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1.
10
GenoMetric Query Language: a novel approach to large-scale genomic data management.基因组查询语言:一种大规模基因组数据管理的新方法。
Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048. Epub 2015 Feb 3.

引用本文的文献

1
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.BigFiRSt:一种使用大数据技术从大规模测序数据中挖掘简单序列重复序列的软件程序。
Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.
2
Cloud Computing Enabled Big Multi-Omics Data Analytics.基于云计算的大型多组学数据分析
Bioinform Biol Insights. 2021 Jul 28;15:11779322211035921. doi: 10.1177/11779322211035921. eCollection 2021.
3
Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.

本文引用的文献

1
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.BioPig:一个基于 Hadoop 的大规模序列数据分析工具包。
Bioinformatics. 2013 Dec 1;29(23):3014-9. doi: 10.1093/bioinformatics/btt528. Epub 2013 Sep 10.
2
Biology: The big challenges of big data.生物学:大数据的巨大挑战。
Nature. 2013 Jun 13;498(7453):255-60. doi: 10.1038/498255a.
3
Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds.Cloudgene:一个在私有云和公有云上运行 MapReduce 程序的图形化执行平台。
压缩泛基因组的分布式混合索引,实现可扩展和快速的序列比对。
PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021.
4
A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.一种用于提高支持大型多组学数据分析的NGS数据质量的大规模无服务器计算方法。
Front Genet. 2021 Jul 13;12:699280. doi: 10.3389/fgene.2021.699280. eCollection 2021.
5
Parallel Algorithms for Inferring Gene Regulatory Networks: A Review.用于推断基因调控网络的并行算法:综述
Curr Genomics. 2018 Nov;19(7):603-614. doi: 10.2174/1389202919666180601081718.
6
Benchmarking distributed data warehouse solutions for storing genomic variant information.用于存储基因组变异信息的分布式数据仓库解决方案的基准测试
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax049.
7
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.HAlign-II:利用分布式和并行计算实现高效的超大倍数序列比对及系统发育树重建
Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017.
8
START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries.起点:一个系统,通过几行类似 SQL 的查询语句即可灵活分析数百个基因组信号轨迹。
BMC Genomics. 2017 Sep 22;18(1):749. doi: 10.1186/s12864-017-4071-1.
9
Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations.可扩展宏基因组比对研究工具(SMART):一种用于对复杂序列群体中的宏基因组序列进行分类的可扩展、快速且完整的搜索启发式方法。
BMC Bioinformatics. 2016 Jul 28;17:292. doi: 10.1186/s12859-016-1159-6.
10
Single-cell Transcriptome Study as Big Data.作为大数据的单细胞转录组研究
Genomics Proteomics Bioinformatics. 2016 Feb;14(1):21-30. doi: 10.1016/j.gpb.2016.01.005. Epub 2016 Feb 11.
BMC Bioinformatics. 2012 Aug 13;13:200. doi: 10.1186/1471-2105-13-200.
4
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.Hadoop-BAM:在云中直接操作下一代测序数据。
Bioinformatics. 2012 Mar 15;28(6):876-7. doi: 10.1093/bioinformatics/bts054. Epub 2012 Feb 2.
5
SAMQA: error classification and validation of high-throughput sequenced read data.SAMQA:高通量测序读段数据的错误分类和验证。
BMC Genomics. 2011 Aug 18;12:419. doi: 10.1186/1471-2164-12-419.
6
SEAL: a distributed short read mapping and duplicate removal tool.SEAL:一种分布式短读映射和去重工具。
Bioinformatics. 2011 Aug 1;27(15):2159-60. doi: 10.1093/bioinformatics/btr325. Epub 2011 Jun 22.
7
SeqWare Query Engine: storing and searching sequence data in the cloud.SeqWare 查询引擎:在云端存储和搜索序列数据。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2.
8
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.Hadoop/MapReduce/HBase 框架概述及其在生物信息学中的当前应用。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1.
9
The case for cloud computing in genome informatics.云计算在基因组信息学中的应用。
Genome Biol. 2010;11(5):207. doi: 10.1186/gb-2010-11-5-207. Epub 2010 May 5.
10
Searching for SNPs with cloud computing.利用云计算搜索 SNP。
Genome Biol. 2009;10(11):R134. doi: 10.1186/gb-2009-10-11-r134. Epub 2009 Nov 20.