• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
An efficient classification algorithm for NGS data based on text similarity.一种基于文本相似度的高效NGS数据分类算法。
Genet Res (Camb). 2018 Sep 17;100:e8. doi: 10.1017/S0016672318000058.
2
GAM-NGS: genomic assemblies merger for next generation sequencing.GAM-NGS:用于下一代测序的基因组组装合并。
BMC Bioinformatics. 2013;14 Suppl 7(Suppl 7):S6. doi: 10.1186/1471-2105-14-S7-S6. Epub 2013 Apr 22.
3
SEED: efficient clustering of next-generation sequences.SEED:下一代序列的高效聚类。
Bioinformatics. 2011 Sep 15;27(18):2502-9. doi: 10.1093/bioinformatics/btr447. Epub 2011 Aug 2.
4
SeedsGraph: an efficient assembler for next-generation sequencing data.SeedsGraph:一种用于下一代测序数据的高效组装器。
BMC Med Genomics. 2015;8 Suppl 2(Suppl 2):S13. doi: 10.1186/1755-8794-8-S2-S13. Epub 2015 May 29.
5
HISEA: HIerarchical SEed Aligner for PacBio data.HISEA:用于PacBio数据的分层种子比对器。
BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.
6
RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.RepAHR:通过组装高频读段进行从头鉴定重复序列的改进方法。
BMC Bioinformatics. 2020 Oct 19;21(1):463. doi: 10.1186/s12859-020-03779-w.
7
LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads.LightAssembler:一种用于高通量测序reads 的快速且节省内存的组装算法。
Bioinformatics. 2016 Nov 1;32(21):3215-3223. doi: 10.1093/bioinformatics/btw470. Epub 2016 Jul 13.
8
Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases.将FASTA文件转换为特征向量以对短读段数据库进行无监督压缩。
J Bioinform Comput Biol. 2021 Feb;19(1):2050048. doi: 10.1142/S0219720020500481. Epub 2021 Jan 20.
9
RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly.RMI-DBG 算法:一种更灵活的迭代 de Bruijn 图算法,用于短读长基因组组装。
J Bioinform Comput Biol. 2021 Apr;19(2):2150005. doi: 10.1142/S0219720021500050. Epub 2021 Apr 16.
10
Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms.优化从头转录组组装从高通量短读测序数据提高非模式生物的功能注释。
BMC Bioinformatics. 2012 Jul 18;13:170. doi: 10.1186/1471-2105-13-170.

引用本文的文献

1
Unique -mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome Profiling.独特的单核苷酸多态性作为系统发育分析和自然微生物组分析的菌株特异性条码。
Int J Mol Sci. 2020 Jan 31;21(3):944. doi: 10.3390/ijms21030944.

本文引用的文献

1
MeShClust: an intelligent tool for clustering DNA sequences.MeShClust:一种用于聚类 DNA 序列的智能工具。
Nucleic Acids Res. 2018 Aug 21;46(14):e83. doi: 10.1093/nar/gky315.
2
A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model.基于拉普拉斯特征映射和高斯混合模型的核苷酸序列聚类包。
Comput Biol Med. 2018 Feb 1;93:66-74. doi: 10.1016/j.compbiomed.2017.12.003. Epub 2017 Dec 15.
3
Metagenome sequence clustering with hash-based canopies.基于哈希冠层的宏基因组序列聚类。
J Bioinform Comput Biol. 2017 Dec;15(6):1740006. doi: 10.1142/S0219720017400066. Epub 2017 Oct 9.
4
DACE: a scalable DP-means algorithm for clustering extremely large sequence data.DACE:一种用于对超大型序列数据进行聚类的可扩展DP均值算法。
Bioinformatics. 2017 Mar 15;33(6):834-842. doi: 10.1093/bioinformatics/btw722.
5
Sequential Discrete Hashing for Scalable Cross-Modality Similarity Retrieval.用于可扩展跨模态相似性检索的序贯离散哈希。
IEEE Trans Image Process. 2017 Jan;26(1):107-118. doi: 10.1109/TIP.2016.2619262. Epub 2016 Oct 19.
6
The present and future of de novo whole-genome assembly.从头开始的全基因组组装的现在和未来。
Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096.
7
KMC 2: fast and resource-frugal k-mer counting.KMC 2:快速且资源节约型的k-mer计数法
Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.
8
Deciphering the human microbiome using next-generation sequencing data and bioinformatics approaches.利用下一代测序数据和生物信息学方法解析人类微生物组。
Methods. 2015 Jun;79-80:52-9. doi: 10.1016/j.ymeth.2014.10.022. Epub 2014 Oct 28.
9
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.SOAPdenovo2:一种经验丰富的、内存效率高的短读长从头组装器。
Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.
10
QUAST: quality assessment tool for genome assemblies.QUAST:基因组组装质量评估工具。
Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19.

一种基于文本相似度的高效NGS数据分类算法。

An efficient classification algorithm for NGS data based on text similarity.

作者信息

Liao Xiangyu, Liao Xingyu, Zhu Wufei, Fang Lu, Chen Xing

机构信息

Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.

School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.

出版信息

Genet Res (Camb). 2018 Sep 17;100:e8. doi: 10.1017/S0016672318000058.

DOI:10.1017/S0016672318000058
PMID:30221607
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6865153/
Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

摘要

随着高通量测序技术的进步,可用测序数据量正以惊人的速度增长,这已开始对现代计算机系统的数据处理和存储能力构成巨大挑战。通过聚类去除此类数据中的冗余对于减少内存、磁盘空间和运行时间消耗可能至关重要。此外,在某些分析应用中,它在减少数据集噪声方面也具有良好性能。在本研究中,我们基于高效哈希函数和文本相似度,提出了一种用于下一代测序(NGS)数据的高性能短序列分类算法(HSC)。首先,HSC将所有读段转换为k-mer,然后通过合并重复和反向互补元素形成唯一的k-mer集合。其次,将所有唯一的k-mer存储在哈希表中,其中k-mer字符串存储在键字段中,包含该k-mer的读段ID存储在值字段中。第三,将每个哈希单元转换为由读段组成的短文本。第四,将满足相似度阈值的文本合并成长文本,迭代执行合并操作,直到没有满足合并条件的文本。最后,将长文本转换为由读段组成的簇。我们使用五个真实数据集对HSC进行了测试。实验结果表明,HSC能在2小时内对1亿条短读段进行聚类,并且在减少内存消耗方面具有出色性能。与现有方法相比,HSC比其他工具快得多,它能轻松处理数千万条序列。此外,当将HSC用作预处理工具来生成组装数据时,组装器的内存和时间消耗会大大降低。在N50、NA50和基因组分数方面,它可以帮助组装器实现更好的组装效果。