• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Nephele:通过完全组合向量和MapReduce进行基因分型。

Nephele: genotyping via complete composition vectors and MapReduce.

作者信息

Colosimo Marc E, Peterson Matthew W, Mardis Scott, Hirschman Lynette

机构信息

The MITRE Corporation, 202 Burlington Rd, Bedford MA 01730, USA.

出版信息

Source Code Biol Med. 2011 Aug 18;6:13. doi: 10.1186/1751-0473-6-13.

DOI:10.1186/1751-0473-6-13
PMID:21851626
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3182884/
Abstract

BACKGROUND

Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.

RESULTS

Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.

CONCLUSIONS

We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.

摘要

背景

当前的测序技术使得对给定生物体的多个样本进行测序成为现实,这给处理和解释带有相关元数据的大型基因组数据集带来了新的挑战。传统的计算系统发育方法对于研究基因/蛋白质家族的进化以及利用这些来推断生物体的进化是理想的,但对于整个生物体的研究并不理想,主要原因是存在插入/缺失/重排。这些方法使研究人员能够根据序列相似性将一组样本分组到不同的基因型组中,然后可以将其与元数据相关联,例如宿主信息、致病性以及发生的时间或地点。基因分型对于在基因组水平上理解传染病的起源和传播至关重要。基因分型越来越多地用于疾病监测活动以及微生物法医学。经典的基因分型方法基于系统发育分析,从多序列比对开始。然后通过专家检查系统发育树来确定基因型。然而,这些传统的单处理器方法对于由下一代DNA测序机器生成的快速增长的序列数据集来说并不理想,因为它们的计算复杂度会随着序列数量的增加而迅速增加。

结果

Nephele是一套工具,它使用完整组成向量算法,通过避免多序列比对的需求,将数据集中的每个序列表示为从其组成的k-mer衍生而来的向量,并使用亲和传播聚类基于向量上的距离度量将序列分组为基因型。我们的方法产生的结果与专家定义的进化枝或基因型高度相关,而计算成本仅为在传统硬件上运行的传统系统发育方法的一小部分。Nephele可以使用MapReduce的开源Hadoop实现,通过多个计算节点并行执行。我们能够在不到2小时的时间内生成超过10000个16S样本的邻接树。

结论

我们得出结论,使用Nephele可以大幅减少在基因组规模序列覆盖下生成数十至数百个生物体的基因型树所需的处理时间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/bfb857d141ba/1751-0473-6-13-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/80082bad6a73/1751-0473-6-13-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/509a409847da/1751-0473-6-13-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/36726c5a50f1/1751-0473-6-13-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/bfb857d141ba/1751-0473-6-13-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/80082bad6a73/1751-0473-6-13-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/509a409847da/1751-0473-6-13-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/36726c5a50f1/1751-0473-6-13-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c82/3182884/bfb857d141ba/1751-0473-6-13-4.jpg

相似文献

1
Nephele: genotyping via complete composition vectors and MapReduce.Nephele:通过完全组合向量和MapReduce进行基因分型。
Source Code Biol Med. 2011 Aug 18;6:13. doi: 10.1186/1751-0473-6-13.
2
CloudBurst: highly sensitive read mapping with MapReduce.CloudBurst:使用MapReduce进行高灵敏度读段比对
Bioinformatics. 2009 Jun 1;25(11):1363-9. doi: 10.1093/bioinformatics/btp236. Epub 2009 Apr 8.
3
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
4
Integrated gene and species phylogenies from unaligned whole genome protein sequences.基于未比对的全基因组蛋白质序列构建的整合基因和物种系统发育树。
Bioinformatics. 2002 Jan;18(1):100-8. doi: 10.1093/bioinformatics/18.1.100.
5
6
7
[Occurrence of Giardia species and genotypes in humans and animals in Wielkopolska region, Poland].[波兰大波兰地区人和动物中贾第虫种类及基因型的出现情况]
Wiad Parazytol. 2009;55(4):459-62.
8
K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity.使用MapReduce框架的K-mer聚类算法:在Trinity的Inchworm模块并行化中的应用。
BMC Bioinformatics. 2017 Nov 3;18(1):467. doi: 10.1186/s12859-017-1881-8.
9
Ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses.幽灵树:用于多样性分析的杂种基因系统发育树的构建。
Microbiome. 2016 Feb 24;4:11. doi: 10.1186/s40168-016-0153-6.
10

引用本文的文献

1
Single-cell Transcriptome Study as Big Data.作为大数据的单细胞转录组研究
Genomics Proteomics Bioinformatics. 2016 Feb;14(1):21-30. doi: 10.1016/j.gpb.2016.01.005. Epub 2016 Feb 11.
2
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.MapReduce 编程框架在临床大数据分析中的应用:现状与未来趋势。
BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014.
3
Enabling large-scale biomedical analysis in the cloud.在云端实现大规模生物医学分析。

本文引用的文献

1
Phylogeny Based on Whole Genome as inferred from Complete Information Set Analysis.基于完整信息集分析推断的全基因组系统发育
J Biol Phys. 2002 Sep;28(3):439-47. doi: 10.1023/A:1020316706928.
2
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.基因组分析工具包:一种用于分析下一代 DNA 测序数据的 MapReduce 框架。
Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.
3
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees.
Biomed Res Int. 2013;2013:185679. doi: 10.1155/2013/185679. Epub 2013 Oct 31.
RF 夫人:一种用于分析大量进化树集合的高效 MapReduce 算法。
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S15. doi: 10.1186/1471-2105-11-S1-S15.
4
CloudBurst: highly sensitive read mapping with MapReduce.CloudBurst:使用MapReduce进行高灵敏度读段比对
Bioinformatics. 2009 Jun 1;25(11):1363-9. doi: 10.1093/bioinformatics/btp236. Epub 2009 Apr 8.
5
TB database: an integrated platform for tuberculosis research.结核病数据库:一个用于结核病研究的综合平台。
Nucleic Acids Res. 2009 Jan;37(Database issue):D499-508. doi: 10.1093/nar/gkn652. Epub 2008 Oct 3.
6
Identifying a few foot-and-mouth disease virus signature nucleotide strings for computational genotyping.鉴定用于计算基因分型的几种口蹄疫病毒特征性核苷酸序列。
BMC Bioinformatics. 2008 Jun 13;9:279. doi: 10.1186/1471-2105-9-279.
7
Inferring evolutionary trees with PAUP*.使用PAUP*推断进化树。
Curr Protoc Bioinformatics. 2003 Feb;Chapter 6:Unit 6.4. doi: 10.1002/0471250953.bi0604s00.
8
The genomic and epidemiological dynamics of human influenza A virus.甲型流感病毒的基因组及流行病学动态
Nature. 2008 May 29;453(7195):615-9. doi: 10.1038/nature06945. Epub 2008 Apr 16.
9
Molecular analysis of avian H7 influenza viruses circulating in Eurasia in 1999-2005: detection of multiple reassortant virus genotypes.1999 - 2005年在欧亚大陆传播的禽H7流感病毒的分子分析:多种重配病毒基因型的检测
J Gen Virol. 2008 Jan;89(Pt 1):48-59. doi: 10.1099/vir.0.83111-0.
10
TreeViewJ: An application for viewing and analyzing phylogenetic trees.TreeViewJ:一款用于查看和分析系统发育树的应用程序。
Source Code Biol Med. 2007 Oct 31;2:7. doi: 10.1186/1751-0473-2-7.