• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于宏基因组 reads 分类归属的新型半监督算法。

A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads.

作者信息

Le Vinh Van, Tran Lang Van, Tran Hoai Van

机构信息

Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, HCM City, Vietnam.

Faculty of Information Technology, HCMC University of Technology and Education, 1 Vo Van Ngan, Thu Duc, HCM City, Vietnam.

出版信息

BMC Bioinformatics. 2016 Jan 6;17:22. doi: 10.1186/s12859-015-0872-x.

DOI:10.1186/s12859-015-0872-x
PMID:26740458
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4702387/
Abstract

BACKGROUND

Taxonomic assignment is a crucial step in a metagenomic project which aims to identify the origin of sequences in an environmental sample. Among the existing methods, since composition-based algorithms are not sufficient for classifying short reads, recent algorithms use only the feature of similarity, or similarity-based combined features. However, those algorithms suffer from the computational expense because the task of similarity search is very time-consuming. Besides, the lack of similarity information between reads and reference sequences due to the length of short reads reduces significantly the classification quality.

RESULTS

This paper presents a novel taxonomic assignment algorithm, called SeMeta, which is based on semi-supervised learning to produce a fast and highly accurate classification of short-length reads with sufficient mutual overlap. The proposed algorithm firstly separates reads into clusters using their composition feature. It then labels the clusters with the support of an efficient filtering technique on results of the similarity search between their reads and reference databases. Furthermore, instead of performing the similarity search for all reads in the clusters, SeMeta only does for reads in their subgroups by utilizing the information of sequence overlapping. The experimental results demonstrate that SeMeta outperforms two other similarity-based algorithms on different aspects.

CONCLUSIONS

By using a semi-supervised method as well as taking the advantages of various features, the proposed algorithm is able not only to achieve high classification quality, but also to reduce much computational cost. The source codes of the algorithm can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html.

摘要

背景

分类归属是宏基因组项目中的关键步骤,该项目旨在识别环境样本中序列的来源。在现有方法中,由于基于组成的算法不足以对短读段进行分类,近期的算法仅使用相似性特征或基于相似性的组合特征。然而,这些算法存在计算成本问题,因为相似性搜索任务非常耗时。此外,由于短读段的长度,读段与参考序列之间缺乏相似性信息,这显著降低了分类质量。

结果

本文提出了一种名为SeMeta的新型分类归属算法,该算法基于半监督学习,能够对具有足够相互重叠的短长度读段进行快速且高度准确的分类。所提出的算法首先利用读段的组成特征将其分成簇。然后,在对读段与参考数据库之间的相似性搜索结果进行有效过滤技术的支持下,为这些簇进行标注。此外,SeMeta不是对簇中的所有读段进行相似性搜索,而是通过利用序列重叠信息仅对其子组中的读段进行搜索。实验结果表明,SeMeta在不同方面优于其他两种基于相似性的算法。

结论

通过使用半监督方法并利用各种特征的优势,所提出的算法不仅能够实现高分类质量,还能降低大量计算成本。该算法的源代码可从http://it.hcmute.edu.vn/bioinfo/metapro/SeMeta.html下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/63a24f3e8887/12859_2015_872_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/b523537fade3/12859_2015_872_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/e1ea89a5f44b/12859_2015_872_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/6c3a6ed6bc82/12859_2015_872_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/ec08c9d2762f/12859_2015_872_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/51fca0d423eb/12859_2015_872_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/2b3c45539634/12859_2015_872_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/63a24f3e8887/12859_2015_872_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/b523537fade3/12859_2015_872_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/e1ea89a5f44b/12859_2015_872_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/6c3a6ed6bc82/12859_2015_872_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/ec08c9d2762f/12859_2015_872_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/51fca0d423eb/12859_2015_872_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/2b3c45539634/12859_2015_872_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c17/4702387/63a24f3e8887/12859_2015_872_Fig7_HTML.jpg

相似文献

1
A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads.一种用于宏基因组 reads 分类归属的新型半监督算法。
BMC Bioinformatics. 2016 Jan 6;17:22. doi: 10.1186/s12859-015-0872-x.
2
A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.一种在非重叠读段组上使用l-mer频率的两阶段分箱算法。
Algorithms Mol Biol. 2015 Jan 16;10(1):2. doi: 10.1186/s13015-014-0030-4. eCollection 2015.
3
DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences.DiScRIBINATE:一种用于宏基因组序列准确分类的快速方法。
BMC Bioinformatics. 2010 Oct 15;11 Suppl 7(Suppl 7):S14. doi: 10.1186/1471-2105-11-S7-S14.
4
A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads.一种用于宏基因组测序reads 精确分类学分配的统计框架。
PLoS One. 2012;7(10):e46450. doi: 10.1371/journal.pone.0046450. Epub 2012 Oct 1.
5
MTR: taxonomic annotation of short metagenomic reads using clustering at multiple taxonomic ranks.MTR:使用多种分类等级的聚类对短宏基因组reads 进行分类注释。
Bioinformatics. 2011 Jan 15;27(2):196-203. doi: 10.1093/bioinformatics/btq649. Epub 2010 Dec 1.
6
Mora: abundance aware metagenomic read re-assignment for disentangling similar strains.莫拉:用于区分相似菌株的丰度感知宏基因组读数重新分配法
BMC Bioinformatics. 2024 Apr 23;25(1):161. doi: 10.1186/s12859-024-05768-9.
7
INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences.INDUS-一种基于组合的方法,用于快速准确地对宏基因组序列进行分类。
BMC Genomics. 2011 Nov 30;12 Suppl 3(Suppl 3):S4. doi: 10.1186/1471-2164-12-S3-S4.
8
Classifying short genomic fragments from novel lineages using composition and homology.基于组成和同源性对新谱系的短基因组片段进行分类。
BMC Bioinformatics. 2011 Aug 9;12:328. doi: 10.1186/1471-2105-12-328.
9
MBMC: An Effective Markov Chain Approach for Binning Metagenomic Reads from Environmental Shotgun Sequencing Projects.MBMC:一种用于对环境鸟枪法测序项目中的宏基因组读数进行分箱的有效马尔可夫链方法。
OMICS. 2016 Aug;20(8):470-9. doi: 10.1089/omi.2016.0081. Epub 2016 Jul 22.
10
Large-scale metagenomic sequence clustering on map-reduce clusters.在MapReduce集群上进行大规模宏基因组序列聚类
J Bioinform Comput Biol. 2013 Feb;11(1):1340001. doi: 10.1142/S0219720013400015. Epub 2012 Dec 25.

引用本文的文献

1
Active semi-supervised learning for biological data classification.生物数据分类的主动半监督学习。
PLoS One. 2020 Aug 19;15(8):e0237428. doi: 10.1371/journal.pone.0237428. eCollection 2020.
2
High-resolution characterization of the human microbiome.人类微生物组的高分辨率表征
Transl Res. 2017 Jan;179:7-23. doi: 10.1016/j.trsl.2016.07.012. Epub 2016 Jul 25.

本文引用的文献

1
CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.克拉克:使用判别性k-mer对宏基因组和基因组序列进行快速准确分类
BMC Genomics. 2015 Mar 25;16(1):236. doi: 10.1186/s12864-015-1419-2.
2
A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.一种在非重叠读段组上使用l-mer频率的两阶段分箱算法。
Algorithms Mol Biol. 2015 Jan 16;10(1):2. doi: 10.1186/s13015-014-0030-4. eCollection 2015.
3
AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization.
AKE——用于快速分类学分类和可视化的加速k-mer探索网络工具。
BMC Bioinformatics. 2014 Dec 13;15(1):384. doi: 10.1186/s12859-014-0384-0.
4
SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.SWAP-Assembler:面向数千核的可扩展且高效的基因组组装。
BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S2. doi: 10.1186/1471-2105-15-S9-S2. Epub 2014 Sep 10.
5
Next generation sequencing technology: Advances and applications.下一代测序技术:进展与应用
Biochim Biophys Acta. 2014 Oct;1842(10):1932-1941. doi: 10.1016/j.bbadis.2014.06.015. Epub 2014 Jul 1.
6
MetaID: a novel method for identification and quantification of metagenomic samples.元 ID:一种用于宏基因组样本鉴定和定量的新方法。
BMC Genomics. 2013;14 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2164-14-S8-S4. Epub 2013 Dec 9.
7
MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning.MetaCluster-TA:基于组装辅助分箱的宏基因组数据分类注释。
BMC Genomics. 2014;15 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2164-15-S1-S12. Epub 2014 Jan 24.
8
Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective.微生物宏基因组分析的当前机遇与挑战——从生物信息学角度来看。
Brief Bioinform. 2012 Nov;13(6):728-42. doi: 10.1093/bib/bbs039. Epub 2012 Sep 9.
9
MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample.MetaCluster 5.0:一种针对嘈杂样本中低丰度物种的元基因组数据的两阶段分箱方法。
Bioinformatics. 2012 Sep 15;28(18):i356-i362. doi: 10.1093/bioinformatics/bts397.
10
Classification of metagenomic sequences: methods and challenges.宏基因组序列分类:方法与挑战。
Brief Bioinform. 2012 Nov;13(6):669-81. doi: 10.1093/bib/bbs054. Epub 2012 Sep 8.