使用 eXpress-D 在云端进行片段分配。

Fragment assignment in the cloud with eXpress-D.

机构信息

Department of Computer Science, 387 Soda Hall, UC Berkeley, Berkeley, CA 94720, USA.

出版信息

BMC Bioinformatics. 2013 Dec 7;14:358. doi: 10.1186/1471-2105-14-358.

DOI:10.1186/1471-2105-14-358

PMID:24314033

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3881492/

Abstract

BACKGROUND

Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability.

RESULTS

We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters-"the cloud". We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data.

CONCLUSIONS

The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems-such as new frameworks like Spark-for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.

摘要

背景

通过高通量测序实验产生的歧义片段的概率分配已被证明极大地提高了 RNA-Seq 和 ChIP-Seq 分析的准确性，并且是许多其他序列普查实验的重要步骤。常用的优化方法是使用期望最大化（EM）算法的最大似然方法。但是，基于批处理的 EM 方法不适用于测序数据集的大小，近年来测序数据集的大小已大大增加。因此，当前的片段分配方法依赖于启发式或近似值来解决可操作性问题。

结果

我们提出了一种使用 Spark 实现的分布式 EM 解决方案，Spark 是一种数据分析框架，可以通过利用数据中心内的计算集群（“云”）进行扩展。我们证明了我们的实现可以轻松扩展到数十亿个测序片段，同时提供了歧义片段的最大似然分配。该方法的准确性被证明优于最广泛使用的工具，并且当集群资源与输入数据的数量成线性扩展时，可以在恒定的时间内运行。

结论

云计算为分析大规模高通量测序数据所面临的困难提供了一种解决方案，这些数据仍在迅速增长。生物信息学研究人员必须关注分布式系统的发展，例如像 Spark 这样的新框架，以便将现有方法移植到云中，并帮助它们扩展到未来的数据集。我们的软件 eXpress-D 可在以下网址免费获得：http://github.com/adarob/express-d。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a92/3881492/7997ded0bc55/1471-2105-14-358-1.jpg

相似文献

Fragment assignment in the cloud with eXpress-D.使用 eXpress-D 在云端进行片段分配。

BMC Bioinformatics. 2013 Dec 7;14:358. doi: 10.1186/1471-2105-14-358.

Streaming fragment assignment for real-time analysis of sequencing experiments.实时分析测序实验的流片段分配。

Nat Methods. 2013 Jan;10(1):71-3. doi: 10.1038/nmeth.2251. Epub 2012 Nov 18.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE：一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。

PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.

Search Engine for Antimicrobial Resistance: A Cloud Compatible Pipeline and Web Interface for Rapidly Detecting Antimicrobial Resistance Genes Directly from Sequence Data.抗菌药物耐药性搜索引擎：一种与云兼容的流程和网络界面，用于直接从序列数据中快速检测抗菌药物耐药基因。

PLoS One. 2015 Jul 21;10(7):e0133492. doi: 10.1371/journal.pone.0133492. eCollection 2015.

Analyzing large scale genomic data on the cloud with Sparkhit.使用 Sparkhit 分析云端的大规模基因组数据。

Bioinformatics. 2018 May 1;34(9):1457-1465. doi: 10.1093/bioinformatics/btx808.

Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses.Eoulsan：一个基于云计算的框架，可实现高通量测序分析。

Bioinformatics. 2012 Jun 1;28(11):1542-3. doi: 10.1093/bioinformatics/bts165. Epub 2012 Apr 5.

RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-seq datasets.RNA CoMPASS：一种用于RNA测序数据集病原体和宿主转录组分析的双重方法。

PLoS One. 2014 Feb 25;9(2):e89445. doi: 10.1371/journal.pone.0089445. eCollection 2014.

cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud.CL-Dash：用于云环境中生物信息学研究的Hadoop集群的快速配置与部署

Bioinformatics. 2016 Jan 15;32(2):301-3. doi: 10.1093/bioinformatics/btv553. Epub 2015 Oct 1.

Cloud-based introduction to BASH programming for biologists.基于云的生物学 BASH 编程入门。

Brief Bioinform. 2024 Jul 23;25(Supplement_1). doi: 10.1093/bib/bbae244.

DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data.DiNAMO：高通量测序数据中高度敏感的 DNA 基序发现。

BMC Bioinformatics. 2018 Jun 11;19(1):223. doi: 10.1186/s12859-018-2215-1.

引用本文的文献

Comparative Transcriptome Analyses Characterize Expression Signatures Among Males, Females, Neo-Males, and Gynogenetic Females in the Yellow Drum ().比较转录组分析揭示黄鼓鱼雄性、雌性、新雄性和雌核发育雌性之间的表达特征。

Front Genet. 2022 May 13;13:872815. doi: 10.3389/fgene.2022.872815. eCollection 2022.

Harmonization of quality metrics and power calculation in multi-omic studies.多组学研究中质量指标的协调与功效计算。

Nat Commun. 2020 Jun 18;11(1):3092. doi: 10.1038/s41467-020-16937-8.

Large scale microbiome profiling in the cloud.大规模微生物组在云端的分析。

Bioinformatics. 2019 Jul 15;35(14):i13-i22. doi: 10.1093/bioinformatics/btz356.

The draft genome of Ruellia speciosa (Beautiful Wild Petunia: Acanthaceae).蓝花草（美丽野生矮牵牛：爵床科）的基因组草图。

DNA Res. 2017 Apr 1;24(2):179-192. doi: 10.1093/dnares/dsw054.

Ciliary transcription factors and miRNAs precisely regulate Cp110 levels required for ciliary adhesions and ciliogenesis.睫状转录因子和微小RNA精确调节睫状黏附及纤毛发生所需的Cp110水平。

Elife. 2016 Sep 13;5:e17557. doi: 10.7554/eLife.17557.

Deletion of aryl hydrocarbon receptor AHR in mice leads to subretinal accumulation of microglia and RPE atrophy.小鼠中芳烃受体AHR的缺失导致小胶质细胞在视网膜下积聚和视网膜色素上皮萎缩。

Invest Ophthalmol Vis Sci. 2014 Aug 26;55(9):6031-40. doi: 10.1167/iovs.14-15091.

本文引用的文献

Updating RNA-Seq analyses after re-annotation.重新注释后更新 RNA-Seq 分析。

Bioinformatics. 2013 Jul 1;29(13):1631-7. doi: 10.1093/bioinformatics/btt197. Epub 2013 May 14.

Streaming fragment assignment for real-time analysis of sequencing experiments.实时分析测序实验的流片段分配。

Nat Methods. 2013 Jan;10(1):71-3. doi: 10.1038/nmeth.2251. Epub 2012 Nov 18.

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.RSEM：有或无参考基因组的 RNA-Seq 数据的准确转录本定量。

BMC Bioinformatics. 2011 Aug 4;12:323. doi: 10.1186/1471-2105-12-323.

Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data.利用 ChIP-Seq 数据的多读分析技术，在基因组的高度重复区域中发现转录因子结合位点。

PLoS Comput Biol. 2011 Jul;7(7):e1002111. doi: 10.1371/journal.pcbi.1002111. Epub 2011 Jul 14.

Mixture models for analysis of the taxonomic composition of metagenomes.用于宏基因组分类组成分析的混合模型。

Bioinformatics. 2011 Jun 15;27(12):1618-24. doi: 10.1093/bioinformatics/btr266. Epub 2011 May 5.

Improving RNA-Seq expression estimates by correcting for fragment bias.通过纠正片段偏倚来提高 RNA-Seq 表达估计。

Genome Biol. 2011;12(3):R22. doi: 10.1186/gb-2011-12-3-r22. Epub 2011 Mar 16.

Cloud-scale RNA-sequencing differential expression analysis with Myrna.利用 Myrna 进行云规模 RNA-seq 差异表达分析。

Genome Biol. 2010;11(8):R83. doi: 10.1186/gb-2010-11-8-r83. Epub 2010 Aug 11.

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.通过 RNA-Seq 进行转录本组装和定量分析揭示了细胞分化过程中未注释的转录本和异构体转换。

Nat Biotechnol. 2010 May;28(5):511-5. doi: 10.1038/nbt.1621. Epub 2010 May 2.

RNA-Seq gene expression estimation with read mapping uncertainty.基于读段比对不确定性的 RNA-Seq 基因表达估计。

Bioinformatics. 2010 Feb 15;26(4):493-500. doi: 10.1093/bioinformatics/btp692. Epub 2009 Dec 18.

Searching for SNPs with cloud computing.利用云计算搜索 SNP。

Genome Biol. 2009;10(11):R134. doi: 10.1186/gb-2009-10-11-r134. Epub 2009 Nov 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用 eXpress-D 在云端进行片段分配。

Fragment assignment in the cloud with eXpress-D.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献