• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BLAST比对结果中的异常值检测。

Outlier detection in BLAST hits.

作者信息

Shah Nidhi, Altschul Stephen F, Pop Mihai

机构信息

1Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, 20742 USA.

2Computational Biology Branch, NCBI, NLM, NIH, Bethesda, 20894 USA.

出版信息

Algorithms Mol Biol. 2018 Mar 22;13:7. doi: 10.1186/s13015-018-0126-3. eCollection 2018.

DOI:10.1186/s13015-018-0126-3
PMID:29588650
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5863388/
Abstract

BACKGROUND

An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods, which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. Similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. In contrast, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive.

RESULTS

We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log-Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets.

CONCLUSION

Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone.

摘要

背景

宏基因组分析中的一项重要任务是为样本中的序列分配分类标签。最广泛使用的分类学分配方法是将样本中的序列与已知序列数据库进行比较。许多方法使用最佳的BLAST比对结果来分配分类标签。然而,众所周知,最佳的BLAST比对结果可能并不总是对应于最佳的分类学匹配。另一种方法涉及系统发育方法,该方法考虑比对和进化模型,以便更准确地定义序列的分类学起源。基于相似性搜索的方法通常比系统发育方法运行得更快,并且当样本中的生物在数据库中有很好的代表性时效果良好。相比之下,系统发育方法有能力识别样本中的新生物,但计算成本相当高。

结果

我们提出了一种用于宏基因组分类单元识别的两步法;即,使用一种快速方法,通过参考数据库准确地对序列进行分类(这是一个过滤步骤),然后对前一步中未分类的序列使用更复杂的系统发育方法。在这项工作中,我们探索使用最佳BLAST比对结果是否以及何时能产生正确的分类标签。我们开发了一种方法来检测BLAST比对结果中的异常值,以便将系统发育上最密切相关的匹配与来自更远缘相关生物的序列的匹配区分开来。我们使用修改后的BILD(贝叶斯积分对数似然)分数,一种多重比对评分函数,来定义最佳BLAST比对结果子集中的异常值并分配分类标签。我们将我们方法的准确性与RDP分类器进行了比较,结果表明我们的方法产生的错误分类更少,同时能正确分类数据库中不存在的生物。最后,我们在真实的16S rRNA数据集的背景下,评估了我们的方法作为在更昂贵的系统发育分析(在我们的案例中是TIPP)之前的预处理步骤的用途。

结论

我们的实验有力地支持了使用两步法进行准确的分类学分配。我们表明,我们的方法可以在使用系统发育方法之前用作过滤步骤,并提供了一种比单独使用E值和比特分数更多信息来解释BLAST结果的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/0a76e63acdbc/13015_2018_126_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/2b1fa46735eb/13015_2018_126_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/37cd1fc0bd66/13015_2018_126_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/cd4b6515e5b3/13015_2018_126_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/35839602b040/13015_2018_126_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/764ee546d7b0/13015_2018_126_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/9bcf710ae588/13015_2018_126_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/dede340f27fb/13015_2018_126_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/c9e84f546ad8/13015_2018_126_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/85a56ace5e39/13015_2018_126_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/0a76e63acdbc/13015_2018_126_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/2b1fa46735eb/13015_2018_126_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/37cd1fc0bd66/13015_2018_126_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/cd4b6515e5b3/13015_2018_126_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/35839602b040/13015_2018_126_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/764ee546d7b0/13015_2018_126_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/9bcf710ae588/13015_2018_126_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/dede340f27fb/13015_2018_126_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/c9e84f546ad8/13015_2018_126_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/85a56ace5e39/13015_2018_126_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cc8/5863388/0a76e63acdbc/13015_2018_126_Fig10_HTML.jpg

相似文献

1
Outlier detection in BLAST hits.BLAST比对结果中的异常值检测。
Algorithms Mol Biol. 2018 Mar 22;13:7. doi: 10.1186/s13015-018-0126-3. eCollection 2018.
2
How reliable is metabarcoding for pollen identification? An evaluation of different taxonomic assignment strategies by cross-validation.代谢条码技术用于花粉鉴定的可靠性如何?通过交叉验证对不同分类学赋值策略的评估。
PeerJ. 2024 Jan 31;12:e16567. doi: 10.7717/peerj.16567. eCollection 2024.
3
A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy.一种用于16S rRNA基因序列的贝叶斯分类方法,具有更高的物种水平准确性。
BMC Bioinformatics. 2017 May 10;18(1):247. doi: 10.1186/s12859-017-1670-4.
4
TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution.TaxAss:利用自定义淡水数据库实现精细分类学分辨率。
mSphere. 2018 Sep 5;3(5):e00327-18. doi: 10.1128/mSphere.00327-18.
5
Classifying short genomic fragments from novel lineages using composition and homology.基于组成和同源性对新谱系的短基因组片段进行分类。
BMC Bioinformatics. 2011 Aug 9;12:328. doi: 10.1186/1471-2105-12-328.
6
Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms.利用 RDP 分类器预测分类学新颖性并缩小寻找新生物的搜索空间。
PLoS One. 2012;7(3):e32491. doi: 10.1371/journal.pone.0032491. Epub 2012 Mar 5.
7
TaxMan: a taxonomic database manager.TaxMan:一个分类学数据库管理器。
BMC Bioinformatics. 2006 Dec 18;7:536. doi: 10.1186/1471-2105-7-536.
8
SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences.SOrt-ITEMS:基于序列直系同源性的方法,用于改进宏基因组序列的分类学估计。
Bioinformatics. 2009 Jul 15;25(14):1722-30. doi: 10.1093/bioinformatics/btp317. Epub 2009 May 13.
9
Short branches lead to systematic artifacts when BLAST searches are used as surrogate for phylogenetic reconstruction.当使用BLAST搜索作为系统发育重建的替代方法时,短分支会导致系统误差。
Mol Phylogenet Evol. 2017 Feb;107:338-344. doi: 10.1016/j.ympev.2016.11.016. Epub 2016 Nov 26.
10
Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data.构建和评估统一的经过精心整理的参考数据库,以提高使用 16S rRNA 序列数据的细菌分类学分类。
Indian J Med Res. 2020 Jan;151(1):93-103. doi: 10.4103/ijmr.IJMR_220_18.

引用本文的文献

1
Challenges and Opportunities in Analyzing Cancer-Associated Microbiomes.分析癌症相关微生物群的挑战与机遇
Cancer Res. 2025 Aug 12. doi: 10.1158/0008-5472.CAN-24-3629.
2
Sex-Dependent Gut Microbiota Features and Functional Signatures in Metabolic Disfunction-Associated Steatotic Liver Disease.代谢功能障碍相关脂肪性肝病中性别依赖性肠道微生物群特征及功能特征
Nutrients. 2024 Dec 4;16(23):4198. doi: 10.3390/nu16234198.
3
Metagenome reveals the midgut microbial community of Haemaphysalis qinghaiensis ticks collected from yaks and Tibetan sheep.

本文引用的文献

1
Introducing EzBioCloud: a taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies.推出 EzBioCloud:一个统一分类学的 16S rRNA 基因序列和全基因组组装数据库。
Int J Syst Evol Microbiol. 2017 May;67(5):1613-1617. doi: 10.1099/ijsem.0.001755. Epub 2017 May 30.
2
Phylogeny-aware identification and correction of taxonomically mislabeled sequences.基于系统发育的分类错误标记序列的识别与校正
Nucleic Acids Res. 2016 Jun 20;44(11):5022-33. doi: 10.1093/nar/gkw396. Epub 2016 May 10.
3
The Genus Lactobacillus: A Taxonomic Update.
宏基因组揭示了从牦牛和藏绵羊中采集的青海血蜱的肠道微生物群落。
Parasit Vectors. 2024 Aug 31;17(1):370. doi: 10.1186/s13071-024-06442-y.
4
SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning.SeqScreen:通过集成学习进行准确且敏感的致病性序列功能筛选。
Genome Biol. 2022 Jun 20;23(1):133. doi: 10.1186/s13059-022-02695-x.
5
Embracing Ambiguity in the Taxonomic Classification of Microbiome Sequencing Data.在微生物组测序数据的分类学分类中接纳不确定性
Front Genet. 2019 Oct 17;10:1022. doi: 10.3389/fgene.2019.01022. eCollection 2019.
6
Viruses of Polar Aquatic Environments.极地水生环境病毒。
Viruses. 2019 Feb 22;11(2):189. doi: 10.3390/v11020189.
乳酸杆菌属:分类学更新
Probiotics Antimicrob Proteins. 2012 Dec;4(4):217-26. doi: 10.1007/s12602-012-9117-8.
4
TIPP: taxonomic identification and phylogenetic profiling.TIPP:分类鉴定和系统发育分析。
Bioinformatics. 2014 Dec 15;30(24):3548-55. doi: 10.1093/bioinformatics/btu721. Epub 2014 Oct 29.
5
The Earth Microbiome project: successes and aspirations.地球微生物组计划:成就与愿景。
BMC Biol. 2014 Aug 22;12:69. doi: 10.1186/s12915-014-0069-1.
6
Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition.低收入国家幼儿的腹泻会导致肠道微生物群组成发生大规模改变。
Genome Biol. 2014 Jun 27;15(6):R76. doi: 10.1186/gb-2014-15-6-r76.
7
Ribosomal Database Project: data and tools for high throughput rRNA analysis.核糖体数据库项目:高通量 rRNA 分析的数据和工具。
Nucleic Acids Res. 2014 Jan;42(Database issue):D633-42. doi: 10.1093/nar/gkt1244. Epub 2013 Nov 27.
8
The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. SILVA 核糖体 RNA 基因数据库项目:改进的数据处理和基于网络的工具。
Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6. doi: 10.1093/nar/gks1219. Epub 2012 Nov 28.
9
pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree.pplacer:将序列线性时间最大似然和贝叶斯系统发生放置到固定参照树上。
BMC Bioinformatics. 2010 Oct 30;11:538. doi: 10.1186/1471-2105-11-538.
10
An invariant form for the prior probability in estimation problems.估计问题中先验概率的一种不变形式。
Proc R Soc Lond A Math Phys Sci. 1946;186(1007):453-61. doi: 10.1098/rspa.1946.0056.