3GOLD：用于聚类第三代测序数据的优化 Levenshtein 距离。

3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data.

机构信息

College of Science, Department of Biology, Northeastern University, 330 Huntington Ave, Boston, MA, 02115, USA.

Department of Biology, Eastern Nazarene College, 23 E Elm Ave, Quincy, MA, 02170, USA.

出版信息

BMC Bioinformatics. 2022 Mar 20;23(1):95. doi: 10.1186/s12859-022-04637-7.

DOI:10.1186/s12859-022-04637-7

PMID:35307007

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8934446/

Abstract

BACKGROUND

Third-generation sequencing offers some advantages over next-generation sequencing predecessors, but with the caveat of harboring a much higher error rate. Clustering-related sequences is an essential task in modern biology. To accurately cluster sequences rich in errors, error type and frequency need to be accounted for. Levenshtein distance is a well-established mathematical algorithm for measuring the edit distance between words and can specifically weight insertions, deletions and substitutions. However, there are drawbacks to using Levenshtein distance in a biological context and hence has rarely been used for this purpose. We present novel modifications to the Levenshtein distance algorithm to optimize it for clustering error-rich biological sequencing data.

RESULTS

We successfully introduced a bidirectional frameshift allowance with end-user determined accommodation caps combined with weighted error discrimination. Furthermore, our modifications dramatically improved the computational speed of Levenstein distance. For simulated ONT MinION and PacBio Sequel datasets, the average clustering sensitivity for 3GOLD was 41.45% (S.D. 10.39) higher than Sequence-Levenstein distance, 52.14% (S.D. 9.43) higher than Levenshtein distance, 55.93% (S.D. 8.67) higher than Starcode, 42.68% (S.D. 8.09) higher than CD-HIT-EST and 61.49% (S.D. 7.81) higher than DNACLUST. For biological ONT MinION data, 3GOLD clustering sensitivity was 27.99% higher than Sequence-Levenstein distance, 52.76% higher than Levenshtein distance, 56.39% higher than Starcode, 48% higher than CD-HIT-EST and 70.4% higher than DNACLUST.

CONCLUSION

Our modifications to Levenshtein distance have improved its speed and accuracy compared to the classic Levenshtein distance, Sequence-Levenshtein distance and other commonly used clustering approaches on simulated and biological third-generation sequenced datasets. Our clustering approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. A strength of our approach is high accuracy in resolving small clusters and mitigating the number of singletons.

摘要

背景

第三代测序技术相对于下一代测序技术的前身具有一些优势，但存在误码率高的缺点。聚类相关序列是现代生物学中的一项重要任务。为了准确地聚类富含错误的序列，需要考虑错误类型和频率。Levenshtein 距离是一种用于衡量单词编辑距离的成熟数学算法，可以专门对插入、删除和替换进行加权。然而，在生物学背景下使用 Levenshtein 距离存在一些缺点，因此很少用于此目的。我们对 Levenshtein 距离算法进行了新颖的修改，以优化其对富含错误的生物测序数据的聚类。

结果

我们成功地引入了双向移码允许，并结合用户确定的容差上限进行了加权错误区分。此外，我们的修改大大提高了 Levenshtein 距离的计算速度。对于模拟的 ONT MinION 和 PacBio Sequel 数据集，3GOLD 的平均聚类敏感性比 Sequence-Levenstein 距离高 41.45%（S.D. 10.39），比 Levenshtein 距离高 52.14%（S.D. 9.43），比 Starcode 高 55.93%（S.D. 8.67），比 CD-HIT-EST 高 42.68%（S.D. 8.09），比 DNACLUST 高 61.49%（S.D. 7.81）。对于生物 ONT MinION 数据，3GOLD 的聚类敏感性比 Sequence-Levenstein 距离高 27.99%，比 Levenshtein 距离高 52.76%，比 Starcode 高 56.39%，比 CD-HIT-EST 高 48%，比 DNACLUST 高 70.4%。

结论

与经典的 Levenshtein 距离、Sequence-Levenshtein 距离和其他常用的聚类方法相比，我们对 Levenshtein 距离的修改提高了其在模拟和生物第三代测序数据集上的速度和准确性。我们的聚类方法适用于未知聚类中心的数据集，例如使用独特分子标识符生成的数据集以及带有条形码的已知聚类中心的数据集。我们方法的一个优点是能够准确地解析小聚类并减少单例数量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de71/8934446/a18310648cc8/12859_2022_4637_Fig1_HTML.jpg

相似文献

3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data.3GOLD：用于聚类第三代测序数据的优化 Levenshtein 距离。

BMC Bioinformatics. 2022 Mar 20;23(1):95. doi: 10.1186/s12859-022-04637-7.

Starcode: sequence clustering based on all-pairs search.星码：基于全对搜索的序列聚类。

Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.

Interpreting Sequence-Levenshtein distance for determining error type and frequency between two embedded sequences of equal length.解释序列莱文斯坦距离以确定两个等长嵌入序列之间的错误类型和频率。

ArXiv. 2023 Oct 19:arXiv:2310.12833v1.

Levenshtein error-correcting barcodes for multiplexed DNA sequencing.莱文斯坦纠错条码在多重 DNA 测序中的应用。

BMC Bioinformatics. 2013 Sep 11;14:272. doi: 10.1186/1471-2105-14-272.

DNACLUST: accurate and efficient clustering of phylogenetic marker genes.DNACLUST：准确高效的系统发育标记基因聚类

BMC Bioinformatics. 2011 Jun 30;12:271. doi: 10.1186/1471-2105-12-271.

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.基于语法的距离度量能够快速、准确地对大量 16S 序列进行聚类。

BMC Bioinformatics. 2010 Dec 17;11:601. doi: 10.1186/1471-2105-11-601.

Levenshtein distance as a measure of accuracy and precision in forensic PCR-MPS methods.莱文斯坦距离作为法医 PCR-MPS 方法准确性和精密度的度量。

Forensic Sci Int Genet. 2021 Nov;55:102594. doi: 10.1016/j.fsigen.2021.102594. Epub 2021 Sep 11.

Alignment-free clustering of UMI tagged DNA molecules.无比对聚类分析 UMI 标签化 DNA 分子。

Bioinformatics. 2019 Jun 1;35(11):1829-1836. doi: 10.1093/bioinformatics/bty888.

Linear space string correction algorithm using the Damerau-Levenshtein distance.基于 Damerau-Levenshtein 距离的线性空间字符串校正算法。

BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):4. doi: 10.1186/s12859-019-3184-8.

Levenshtein Distance, Sequence Comparison and Biological Database Search.莱文斯坦距离、序列比较与生物数据库搜索。

IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.

引用本文的文献

Efficient discovery of frequently co-occurring mutations in a sequence database with matrix factorization.利用矩阵分解在序列数据库中高效发现频繁共现的突变。

PLoS Comput Biol. 2025 Apr 24;21(4):e1012391. doi: 10.1371/journal.pcbi.1012391. eCollection 2025 Apr.

Ribotyping Staphylococcus epidermidis Using Probabilistic Sequence Analysis and Levenshtein Distance Algorithm.使用概率序列分析和莱文斯坦距离算法对表皮葡萄球菌进行核糖体分型

Curr Microbiol. 2025 Jan 10;82(2):78. doi: 10.1007/s00284-024-04057-1.

Cross-domain information fusion and personalized recommendation in artificial intelligence recommendation system based on mathematical matrix decomposition.基于数学矩阵分解的人工智能推荐系统中的跨域信息融合与个性化推荐

Sci Rep. 2024 Apr 3;14(1):7816. doi: 10.1038/s41598-024-57240-6.

ArXiv. 2023 Oct 19:arXiv:2310.12833v1.

Application of third-generation sequencing to herbal genomics.第三代测序技术在草药基因组学中的应用。

Front Plant Sci. 2023 Mar 7;14:1124536. doi: 10.3389/fpls.2023.1124536. eCollection 2023.

本文引用的文献

Long-read human genome sequencing and its applications.长读长基因组测序及其应用。

Nat Rev Genet. 2020 Oct;21(10):597-614. doi: 10.1038/s41576-020-0236-x. Epub 2020 Jun 5.

LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules.LUCS：一种高分辨率核酸测序工具，可用于对单个 DNA 分子进行准确的长读分析。

Aging (Albany NY). 2020 Apr 28;12(8):7603-7613. doi: 10.18632/aging.103171.

nanoMLST: accurate multilocus sequence typing using Oxford Nanopore Technologies MinION with a dual-barcode approach to multiplex large numbers of samples.nanoMLST：使用 Oxford Nanopore Technologies MinION 进行准确的多位点序列分型，采用双条码方法对大量样本进行多重检测。

Microb Genom. 2020 Mar;6(3). doi: 10.1099/mgen.0.000336.

PaSS: a sequencing simulator for PacBio sequencing.PaSS：一种用于 PacBio 测序的测序模拟程序。

BMC Bioinformatics. 2019 Jun 21;20(1):352. doi: 10.1186/s12859-019-2901-7.

Indel-correcting DNA barcodes for high-throughput sequencing.高通量测序的无错切 DNA 条形码。

Proc Natl Acad Sci U S A. 2018 Jul 3;115(27):E6217-E6226. doi: 10.1073/pnas.1802640115. Epub 2018 Jun 20.

Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis.太平洋生物科学公司和牛津纳米孔技术公司的全面比较及其在转录组分析中的应用。

F1000Res. 2017 Feb 3;6:100. doi: 10.12688/f1000research.10571.2. eCollection 2017.

NanoSim: nanopore sequence read simulator based on statistical characterization.NanoSim：基于统计特征的纳米孔序列读取模拟器。

Gigascience. 2017 Apr 1;6(4):1-6. doi: 10.1093/gigascience/gix010.

UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy.UMI-tools：对独特分子标识符中的测序错误进行建模以提高定量准确性。

Genome Res. 2017 Mar;27(3):491-499. doi: 10.1101/gr.209601.116. Epub 2017 Jan 18.

Assessing the performance of the Oxford Nanopore Technologies MinION.评估牛津纳米孔技术公司的MinION测序仪的性能。

Biomol Detect Quantif. 2015 Mar;3:1-8. doi: 10.1016/j.bdq.2015.02.001.

Reducing amplification artifacts in high multiplex amplicon sequencing by using molecular barcodes.使用分子条形码减少高多重扩增子测序中的扩增假象

BMC Genomics. 2015 Aug 7;16(1):589. doi: 10.1186/s12864-015-1806-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

3GOLD：用于聚类第三代测序数据的优化 Levenshtein 距离。

3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献