College of Science, Department of Biology, Northeastern University, 330 Huntington Ave, Boston, MA, 02115, USA.
Department of Biology, Eastern Nazarene College, 23 E Elm Ave, Quincy, MA, 02170, USA.
BMC Bioinformatics. 2022 Mar 20;23(1):95. doi: 10.1186/s12859-022-04637-7.
Third-generation sequencing offers some advantages over next-generation sequencing predecessors, but with the caveat of harboring a much higher error rate. Clustering-related sequences is an essential task in modern biology. To accurately cluster sequences rich in errors, error type and frequency need to be accounted for. Levenshtein distance is a well-established mathematical algorithm for measuring the edit distance between words and can specifically weight insertions, deletions and substitutions. However, there are drawbacks to using Levenshtein distance in a biological context and hence has rarely been used for this purpose. We present novel modifications to the Levenshtein distance algorithm to optimize it for clustering error-rich biological sequencing data.
We successfully introduced a bidirectional frameshift allowance with end-user determined accommodation caps combined with weighted error discrimination. Furthermore, our modifications dramatically improved the computational speed of Levenstein distance. For simulated ONT MinION and PacBio Sequel datasets, the average clustering sensitivity for 3GOLD was 41.45% (S.D. 10.39) higher than Sequence-Levenstein distance, 52.14% (S.D. 9.43) higher than Levenshtein distance, 55.93% (S.D. 8.67) higher than Starcode, 42.68% (S.D. 8.09) higher than CD-HIT-EST and 61.49% (S.D. 7.81) higher than DNACLUST. For biological ONT MinION data, 3GOLD clustering sensitivity was 27.99% higher than Sequence-Levenstein distance, 52.76% higher than Levenshtein distance, 56.39% higher than Starcode, 48% higher than CD-HIT-EST and 70.4% higher than DNACLUST.
Our modifications to Levenshtein distance have improved its speed and accuracy compared to the classic Levenshtein distance, Sequence-Levenshtein distance and other commonly used clustering approaches on simulated and biological third-generation sequenced datasets. Our clustering approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. A strength of our approach is high accuracy in resolving small clusters and mitigating the number of singletons.
第三代测序技术相对于下一代测序技术的前身具有一些优势,但存在误码率高的缺点。聚类相关序列是现代生物学中的一项重要任务。为了准确地聚类富含错误的序列,需要考虑错误类型和频率。Levenshtein 距离是一种用于衡量单词编辑距离的成熟数学算法,可以专门对插入、删除和替换进行加权。然而,在生物学背景下使用 Levenshtein 距离存在一些缺点,因此很少用于此目的。我们对 Levenshtein 距离算法进行了新颖的修改,以优化其对富含错误的生物测序数据的聚类。
我们成功地引入了双向移码允许,并结合用户确定的容差上限进行了加权错误区分。此外,我们的修改大大提高了 Levenshtein 距离的计算速度。对于模拟的 ONT MinION 和 PacBio Sequel 数据集,3GOLD 的平均聚类敏感性比 Sequence-Levenstein 距离高 41.45%(S.D. 10.39),比 Levenshtein 距离高 52.14%(S.D. 9.43),比 Starcode 高 55.93%(S.D. 8.67),比 CD-HIT-EST 高 42.68%(S.D. 8.09),比 DNACLUST 高 61.49%(S.D. 7.81)。对于生物 ONT MinION 数据,3GOLD 的聚类敏感性比 Sequence-Levenstein 距离高 27.99%,比 Levenshtein 距离高 52.76%,比 Starcode 高 56.39%,比 CD-HIT-EST 高 48%,比 DNACLUST 高 70.4%。
与经典的 Levenshtein 距离、Sequence-Levenshtein 距离和其他常用的聚类方法相比,我们对 Levenshtein 距离的修改提高了其在模拟和生物第三代测序数据集上的速度和准确性。我们的聚类方法适用于未知聚类中心的数据集,例如使用独特分子标识符生成的数据集以及带有条形码的已知聚类中心的数据集。我们方法的一个优点是能够准确地解析小聚类并减少单例数量。