• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

近似最近邻图为大规模生物数据的应用提供了快速有效的嵌入。

Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.

作者信息

Zhao Jianshu, Pierre Both Jean, Konstantinidis Konstantinos T

机构信息

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, 225 North Avenue NW, Atlanta, GA, 30332, USA.

School of Biological Sciences, Georgia Institute of Technology, 225 North Avenue NW, Atlanta, GA, 30332, USA.

出版信息

NAR Genom Bioinform. 2024 Dec 18;6(4):lqae172. doi: 10.1093/nargab/lqae172. eCollection 2024 Dec.

DOI:10.1093/nargab/lqae172
PMID:39703432
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11655291/
Abstract

Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.

摘要

诸如t-SNE和UMAP之类的降维(DR或嵌入)算法在大数据可视化中有许多应用,但对于大型数据集而言仍然速度较慢。在此,我们通过以下方式进一步改进类UMAP算法:(i)结合t-SNE和UMAP的多个方面来创建一种新的DR算法;(ii)用分层可导航小世界(HNSW)图替换其限速步骤——K近邻图(K-NNG);(iii)通过将HNSW与局部敏感哈希算法(如MinHash)相结合以进行序列间距离估计,从而将功能扩展到DNA/RNA序列数据。我们还提供了其他功能,包括局部固有维度和中心性的计算,这些可以反映强烈影响K-NNG准确性进而影响所得嵌入质量的基础数据的结构和属性。我们名为annembed的库已用Rust实现并完全并行化,与流行的类UMAP算法相比显示出具有竞争力的准确性。此外,我们通过三个实际示例展示了我们库的实用性和可扩展性:可视化大规模微生物基因组数据库、可视化单细胞RNA测序数据以及宏基因组重叠群(或群体)分箱。因此,annembed可以促进生物数据分析中若干任务的降维,在这些任务中距离计算成本高昂,或者存在数百万到数十亿个数据点需要处理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/1cfa6ecb2270/lqae172fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/55e8cabd8b3e/lqae172fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/97466375c0d0/lqae172fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/e8288fb64d15/lqae172fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/1cfa6ecb2270/lqae172fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/55e8cabd8b3e/lqae172fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/97466375c0d0/lqae172fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/e8288fb64d15/lqae172fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9435/11655291/1cfa6ecb2270/lqae172fig4.jpg

相似文献

1
Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.近似最近邻图为大规模生物数据的应用提供了快速有效的嵌入。
NAR Genom Bioinform. 2024 Dec 18;6(4):lqae172. doi: 10.1093/nargab/lqae172. eCollection 2024 Dec.
2
GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.GSearch:通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。
Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.
3
Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters.用于检测可疑的 2D 单细胞嵌入并优化 t-SNE 和 UMAP 参数的统计方法 scDEED。
Nat Commun. 2024 Feb 26;15(1):1753. doi: 10.1038/s41467-024-45891-y.
4
Shape-aware stochastic neighbor embedding for robust data visualisations.形状感知随机近邻嵌入的稳健数据可视化。
BMC Bioinformatics. 2022 Nov 14;23(1):477. doi: 10.1186/s12859-022-05028-8.
5
DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data.DGCyTOF:基于图形聚类可视化的深度学习,用于预测单细胞质谱流式细胞术数据的细胞类型。
PLoS Comput Biol. 2022 Apr 11;18(4):e1008885. doi: 10.1371/journal.pcbi.1008885. eCollection 2022 Apr.
6
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.使用分层可导航小世界图进行高效且鲁棒的近似最近邻搜索
IEEE Trans Pattern Anal Mach Intell. 2020 Apr;42(4):824-836. doi: 10.1109/TPAMI.2018.2889473. Epub 2018 Dec 28.
7
Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data.UMAP 通过降维增强了批量转录组数据中样本异质性分析。
Cell Rep. 2021 Jul 27;36(4):109442. doi: 10.1016/j.celrep.2021.109442.
8
UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.基于 UMAP 的 SARS-CoV-2 大规模突变数据集的 K-means 聚类分析。
Comput Biol Med. 2021 Apr;131:104264. doi: 10.1016/j.compbiomed.2021.104264. Epub 2021 Feb 22.
9
scDEED: a statistical method for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters.scDEED:一种用于检测可疑二维单细胞嵌入并优化t-SNE和UMAP超参数的统计方法。
bioRxiv. 2023 Sep 15:2023.04.21.537839. doi: 10.1101/2023.04.21.537839.
10
K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis.K 近邻诱导拓扑主成分分析在单细胞 RNA 测序数据分析中的应用。
Comput Biol Med. 2024 Jun;175:108497. doi: 10.1016/j.compbiomed.2024.108497. Epub 2024 Apr 24.

本文引用的文献

1
GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.GSearch:通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。
Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.
2
BinaRena: a dedicated interactive platform for human-guided exploration and binning of metagenomes.BinaRena:一个专门用于人类引导的探索和宏基因组分箱的交互式平台。
Microbiome. 2023 Aug 19;11(1):186. doi: 10.1186/s40168-023-01625-8.
3
IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata.
IMG/VR v4:一个扩展的未培养病毒基因组数据库,其中包含广泛的功能、分类和生态元数据框架。
Nucleic Acids Res. 2023 Jan 6;51(D1):D733-D743. doi: 10.1093/nar/gkac1037.
4
BusyBee Web: towards comprehensive and differential composition-based metagenomic binning.忙碌蜂网络:全面且差异化的基于组合的宏基因组 bin 划分方法
Nucleic Acids Res. 2022 Jul 5;50(W1):W132-W137. doi: 10.1093/nar/gkac298.
5
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.GTDB:通过系统发生一致、等级归一化和基于完整基因组的分类学,对细菌和古菌多样性进行持续普查。
Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.
6
Re-evaluating the evidence for a universal genetic boundary among microbial species.重新评估微生物物种间普遍存在的遗传界限的证据。
Nat Commun. 2021 Jul 7;12(1):4059. doi: 10.1038/s41467-021-24128-2.
7
Initialization is critical for preserving global data structure in both t-SNE and UMAP.初始化对于在t-SNE和UMAP中保存全局数据结构至关重要。
Nat Biotechnol. 2021 Feb;39(2):156-157. doi: 10.1038/s41587-020-00809-z. Epub 2021 Feb 1.
8
The art of using t-SNE for single-cell transcriptomics.使用 t-SNE 进行单细胞转录组学分析的艺术。
Nat Commun. 2019 Nov 28;10(1):5416. doi: 10.1038/s41467-019-13056-x.
9
Locality-sensitive hashing for the edit distance.基于编辑距离的位置敏感哈希
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.
10
A lineage-resolved molecular atlas of embryogenesis at single-cell resolution.单细胞分辨率解析胚胎发生的谱系分辨分子图谱。
Science. 2019 Sep 20;365(6459). doi: 10.1126/science.aax1971. Epub 2019 Sep 5.