• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

光谱杰卡德相似度:一种估计成对序列比对的新方法。

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.

作者信息

Baharav Tavor Z, Kamath Govinda M, Tse David N, Shomorony Ilan

机构信息

Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.

Microsoft Research New England, Cambridge, MA 02142, USA.

出版信息

Patterns (N Y). 2020 Jul 31;1(6):100081. doi: 10.1016/j.patter.2020.100081. eCollection 2020 Sep 11.

DOI:10.1016/j.patter.2020.100081
PMID:33205128
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7660437/
Abstract

Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise -mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the -mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven -mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.

摘要

成对序列比对通常是基因组分析流程中的一个计算瓶颈,特别是在第三代测序技术的背景下。为了加速这一过程,有时会使用成对的k-mer杰卡德相似度作为比对大小的代理,以便过滤读段对,并采用最小哈希来有效估计这些相似度。然而,当数据集的k-mer分布明显不均匀时(例如,由于GC偏差和重复序列),杰卡德相似度就不再是比对大小的良好代理。在这项工作中,我们引入了一种基于最小哈希的方法来估计比对大小,称为谱杰卡德相似度,它自然地考虑了不均匀的k-mer分布。谱杰卡德相似度是通过对最小哈希碰撞矩阵进行奇异值分解来计算的。我们通过实验表明,这个新指标能为比对大小提供明显更好的估计,并且我们为这些谱相似度分数提供了一个计算高效的估计器。

相似文献

1
Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.光谱杰卡德相似度:一种估计成对序列比对的新方法。
Patterns (N Y). 2020 Jul 31;1(6):100081. doi: 10.1016/j.patter.2020.100081. eCollection 2020 Sep 11.
2
LexicHash: sequence similarity estimation via lexicographic comparison of hashes.LexicHash:通过字典序比较哈希值进行序列相似性估计。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad652.
3
Locality-sensitive hashing for the edit distance.基于编辑距离的位置敏感哈希
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.
4
Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and -mer Hashing.迭代间隔种子哈希:缩小间隔种子哈希与k-mer哈希之间的差距。
J Comput Biol. 2020 Feb;27(2):223-233. doi: 10.1089/cmb.2019.0298. Epub 2019 Dec 4.
5
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.最小哈希值是最小化器的一种推广,可实现无偏局部杰卡德估计。
bioRxiv. 2023 May 18:2023.05.16.540882. doi: 10.1101/2023.05.16.540882.
6
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.极小值是极小值的推广,能够实现无偏的局部杰卡德估计。
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad512.
7
Semi-supervised hashing for large-scale search.半监督哈希算法在大规模搜索中的应用
IEEE Trans Pattern Anal Mach Intell. 2012 Dec;34(12):2393-406. doi: 10.1109/TPAMI.2012.48.
8
Fast computation of the eigensystem of genomic similarity matrices.基因组相似性矩阵特征系统的快速计算
BMC Bioinformatics. 2024 Jan 25;25(1):43. doi: 10.1186/s12859-024-05650-8.
9
LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes.LINflow:一种计算流程,它将一种无比对方法与一种基于比对的方法相结合,以加速原核生物基因组相似性矩阵的生成。
PeerJ. 2021 Mar 24;9:e10906. doi: 10.7717/peerj.10906. eCollection 2021.
10
DandD: Efficient measurement of sequence growth and similarity.DandD:序列增长与相似性的高效测量
iScience. 2024 Feb 1;27(3):109054. doi: 10.1016/j.isci.2024.109054. eCollection 2024 Mar 15.

引用本文的文献

1
Sequence-based prioritization of i-Motif candidates in the human genome.基于序列的人类基因组中i-基序候选序列的优先级排序。
Front Bioinform. 2025 Aug 12;5:1657841. doi: 10.3389/fbinf.2025.1657841. eCollection 2025.
2
Pattern-based quantum text watermarking: Securing digital content with next-Gen quantum techniques.基于模式的量子文本水印:用下一代量子技术保护数字内容。
iScience. 2024 Nov 12;27(12):111364. doi: 10.1016/j.isci.2024.111364. eCollection 2024 Dec 20.
3
Employing bimodal representations to predict DNA bendability within a self-supervised pre-trained framework.

本文引用的文献

1
Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.
2
Locality-sensitive hashing for the edit distance.基于编辑距离的位置敏感哈希
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.
3
Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer:使用基因组草图进行无组装和无比对的样本识别。
在自我监督的预训练框架内采用双模态表示来预测 DNA 弯曲性。
Nucleic Acids Res. 2024 Apr 12;52(6):e33. doi: 10.1093/nar/gkae099.
4
SCInter: A comprehensive single-cell transcriptome integration database for human and mouse.SCInter:一个用于人类和小鼠的综合性单细胞转录组整合数据库。
Comput Struct Biotechnol J. 2023 Nov 15;23:77-86. doi: 10.1016/j.csbj.2023.11.024. eCollection 2024 Dec.
5
LexicHash: sequence similarity estimation via lexicographic comparison of hashes.LexicHash:通过字典序比较哈希值进行序列相似性估计。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad652.
6
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
7
Integrating functional data analysis with case-based reasoning for hypertension prognosis and diagnosis based on real-world electronic health records.基于真实世界电子健康记录的高血压预后和诊断的功能数据分析与基于案例推理的整合。
BMC Med Inform Decis Mak. 2022 Jun 6;22(1):149. doi: 10.1186/s12911-022-01894-7.
8
Weighted minimizer sampling improves long read mapping.加权最小化抽样提高长读测序数据的比对。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118. doi: 10.1093/bioinformatics/btaa435.
Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.
4
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.
5
Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis.太平洋生物科学公司和牛津纳米孔技术公司的全面比较及其在转录组分析中的应用。
F1000Res. 2017 Feb 3;6:100. doi: 10.12688/f1000research.10571.2. eCollection 2017.
6
HINGE: long-read assembly achieves optimal repeat resolution.HINGE:长读长组装可实现最佳的重复序列解析。
Genome Res. 2017 May;27(5):747-756. doi: 10.1101/gr.216465.116. Epub 2017 Mar 20.
7
Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation.Canu:通过自适应k-mer加权和重复序列分离实现可扩展且准确的长读长序列拼接
Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.
8
Fast and accurate de novo genome assembly from long uncorrected reads.从长的未校正读段中进行快速且准确的从头基因组组装。
Genome Res. 2017 May;27(5):737-746. doi: 10.1101/gr.214270.116. Epub 2017 Jan 18.
9
Phased diploid genome assembly with single-molecule real-time sequencing.基于单分子实时测序的阶段性二倍体基因组组装
Nat Methods. 2016 Dec;13(12):1050-1054. doi: 10.1038/nmeth.4035. Epub 2016 Oct 17.
10
Genome Skimming: A Rapid Approach to Gaining Diverse Biological Insights into Multicellular Pathogens.基因组快速扫描:一种获取多细胞病原体多样生物学见解的快速方法。
PLoS Pathog. 2016 Aug 4;12(8):e1005713. doi: 10.1371/journal.ppat.1005713. eCollection 2016 Aug.