霍珀：一种用于生物数据草图绘制的数学最优算法。

Hopper: a mathematically optimal algorithm for sketching biological data.

机构信息

Department of Bioinformatics, Harvard University, Cambridge, MA 02138, USA.

Computer Science and Artificial Intelligence Laboratory.

出版信息

Bioinformatics. 2020 Jul 1;36(Suppl_1):i236-i241. doi: 10.1093/bioinformatics/btaa408.

DOI:10.1093/bioinformatics/btaa408

PMID:32657375

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7355272/

Abstract

MOTIVATION

Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today's largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations.

RESULTS

Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper's even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium.

AVAILABILITY AND IMPLEMENTATION

The code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu.

摘要

动机

单细胞 RNA 测序自诞生以来已经大规模发展，带来了大量分析和计算方面的挑战。即使是简单的下游分析，如降维和聚类，也需要数天的运行时间和数百千兆字节的内存来处理当今最大的数据集。此外，当前的方法通常偏向常见的细胞类型，而忽略了由小细胞群体捕获的显著生物学特征。

结果

在这里，我们提出了 Hopper，这是一个单细胞工具包，通过智能抽样或草图，既加快了单细胞数据集的分析速度，又突出了其转录多样性。Hopper 实现了全数据集和下采样数据集之间 Hausdorff 距离的最优多项式时间逼近，确保每个细胞都由样本中的某个细胞很好地表示。与之前的草图方法不同，Hopper 会迭代地添加点，并允许从感兴趣的区域进行额外的采样，从而实现快速和有针对性的多分辨率分析。在一个超过 130 万只老鼠大脑细胞的数据集上，Hopper 从一个仅包含 5000 个细胞的 Hopper 草图中检测到仅 64 个巨噬细胞的一个簇，这些巨噬细胞表达炎症基因（占全数据集的 0.004%），而其他几个较小但具有生物学意义的免疫细胞群体在分析全数据时是不可见的。在一个由大约 200 万个发育中的老鼠器官细胞组成的更大的数据集上，我们展示了 Hopper 在小草图中对重要细胞类型的均匀表示，与之前的草图方法形成对比。我们还引入了 Treehopper，它使用空间分区以提高 Hopper 的速度，性能损失可以忽略不计。通过浓缩大型数据集编码的转录信息，Hopper 和 Treehopper 使个人用户拥有了大型联盟的分析能力。

可用性和实现

Hopper 的代码可在 https://github.com/bendemeo/hopper 上获得。此外，我们还提供了许多最大的单细胞数据集的草图，可在 http://hopper.csail.mit.edu 上获得。

相似文献

Hopper: a mathematically optimal algorithm for sketching biological data.霍珀：一种用于生物数据草图绘制的数学最优算法。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i236-i241. doi: 10.1093/bioinformatics/btaa408.

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells.PARC：对数百万个单细胞的表型数据进行超快速和准确的聚类。

Bioinformatics. 2020 May 1;36(9):2778-2786. doi: 10.1093/bioinformatics/btaa042.

Single-cell RNA-seq interpretations using evolutionary multiobjective ensemble pruning.单细胞 RNA-seq 解释使用进化多目标集成修剪。

Bioinformatics. 2019 Aug 15;35(16):2809-2817. doi: 10.1093/bioinformatics/bty1056.

Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape.几何作图法简明概括单细胞转录组景观。

Cell Syst. 2019 Jun 26;8(6):483-493.e7. doi: 10.1016/j.cels.2019.05.003. Epub 2019 Jun 5.

SCHNEL: scalable clustering of high dimensional single-cell data.SCHNEL：高维单细胞数据的可扩展聚类。

Bioinformatics. 2020 Dec 30;36(Suppl_2):i849-i856. doi: 10.1093/bioinformatics/btaa816.

Scarf enables a highly memory-efficient analysis of large-scale single-cell genomics data.Scarf 能够实现对大规模单细胞基因组学数据的高效内存分析。

Nat Commun. 2022 Aug 8;13(1):4616. doi: 10.1038/s41467-022-32097-3.

Evaluating single-cell cluster stability using the Jaccard similarity index.使用 Jaccard 相似性指数评估单细胞聚类稳定性。

Bioinformatics. 2021 Aug 9;37(15):2212-2214. doi: 10.1093/bioinformatics/btaa956.

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.HyperGen：使用超维向量进行紧凑且高效的基因组草图绘制

Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.

netAE: semi-supervised dimensionality reduction of single-cell RNA sequencing to facilitate cell labeling.netAE：单细胞 RNA 测序的半监督降维以促进细胞标记。

Bioinformatics. 2021 Apr 9;37(1):43-49. doi: 10.1093/bioinformatics/btaa669.

Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.使用最小去环集保证小窗口的草图方法。

J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.

引用本文的文献

Benchmarking sketching methods on spatial transcriptomics data.基于空间转录组学数据的草图绘制方法基准测试

bioRxiv. 2025 Sep 2:2025.08.26.672376. doi: 10.1101/2025.08.26.672376.

Unveiling causal regulatory mechanisms through cell-state parallax.通过细胞状态视差揭示因果调控机制。

Nat Commun. 2025 Aug 29;16(1):8096. doi: 10.1038/s41467-025-61337-5.

scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks.scValue：用于机器学习和深度学习任务的大规模单细胞转录组数据的基于值的二次采样。

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf279.

Learning the language of antibody hypervariability.学习抗体高变区的语言。

Proc Natl Acad Sci U S A. 2025 Jan 7;122(1):e2418918121. doi: 10.1073/pnas.2418918121. Epub 2024 Dec 30.

Building and analyzing metacells in single-cell genomics data.在单细胞基因组学数据中构建和分析元细胞。

Mol Syst Biol. 2024 Jul;20(7):744-766. doi: 10.1038/s44320-024-00045-6. Epub 2024 May 29.

Causal gene regulatory analysis with RNA velocity reveals an interplay between slow and fast transcription factors.基于 RNA 速度的因果基因调控分析揭示了慢转录因子和快转录因子之间的相互作用。

Cell Syst. 2024 May 15;15(5):462-474.e5. doi: 10.1016/j.cels.2024.04.005.

Dictionary learning for integrative, multimodal and scalable single-cell analysis.基于字典学习的综合、多模态和可扩展的单细胞分析。

Nat Biotechnol. 2024 Feb;42(2):293-304. doi: 10.1038/s41587-023-01767-y. Epub 2023 May 25.

Navigating bottlenecks and trade-offs in genomic data analysis.基因组数据分析中的瓶颈与权衡。

Nat Rev Genet. 2023 Apr;24(4):235-250. doi: 10.1038/s41576-022-00551-z. Epub 2022 Dec 7.

scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data.scSampler：一种用于大规模单细胞转录组数据的快速保多样性的抽样方法。

Bioinformatics. 2022 May 26;38(11):3126-3127. doi: 10.1093/bioinformatics/btac271.

Fast and memory-efficient scRNA-seq -means clustering with various distances.快速且内存高效的单细胞RNA测序——使用各种距离的均值聚类。

ACM BCB. 2021 Aug;2021. doi: 10.1145/3459930.3469523.

本文引用的文献

Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape.几何作图法简明概括单细胞转录组景观。

Cell Syst. 2019 Jun 26;8(6):483-493.e7. doi: 10.1016/j.cels.2019.05.003. Epub 2019 Jun 5.

The single-cell transcriptional landscape of mammalian organogenesis.哺乳动物器官发生的单细胞转录组图谱。

Nature. 2019 Feb;566(7745):496-502. doi: 10.1038/s41586-019-0969-x. Epub 2019 Feb 20.

Single-Cell RNA Sequencing of Microglia throughout the Mouse Lifespan and in the Injured Brain Reveals Complex Cell-State Changes.单细胞 RNA 测序技术揭示了小鼠整个生命周期及损伤大脑中小胶质细胞的复杂细胞状态变化。

Immunity. 2019 Jan 15;50(1):253-271.e6. doi: 10.1016/j.immuni.2018.11.004. Epub 2018 Nov 21.

SCANPY: large-scale single-cell gene expression data analysis.SCANPY：大规模单细胞基因表达数据分析。

Genome Biol. 2018 Feb 6;19(1):15. doi: 10.1186/s13059-017-1382-0.

dropClust: efficient clustering of ultra-large scRNA-seq data.dropClust：超大规模 scRNA-seq 数据的高效聚类。

Nucleic Acids Res. 2018 Apr 6;46(6):e36. doi: 10.1093/nar/gky007.

The Human Cell Atlas: from vision to reality.人类细胞图谱：从愿景到现实。

Nature. 2017 Oct 18;550(7677):451-453. doi: 10.1038/550451a.

Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning.基于核函数相似性学习的单细胞 RNA-seq 数据可视化与分析。

Nat Methods. 2017 Apr;14(4):414-416. doi: 10.1038/nmeth.4207. Epub 2017 Mar 6.

Entropy-scaling search of massive biological data.海量生物数据的熵尺度搜索

Cell Syst. 2015 Aug 26;1(2):130-140. doi: 10.1016/j.cels.2015.08.004.

AIM/CD5L: a key protein in the control of immune homeostasis and inflammatory disease.AIM/CD5L：免疫稳态和炎症性疾病控制的关键蛋白。

J Leukoc Biol. 2015 Aug;98(2):173-84. doi: 10.1189/jlb.3RU0215-074R. Epub 2015 Jun 5.

Regulator of G-protein signaling 10 promotes dopaminergic neuron survival via regulation of the microglial inflammatory response.G蛋白信号调节因子10通过调节小胶质细胞炎症反应促进多巴胺能神经元存活。

J Neurosci. 2008 Aug 20;28(34):8517-28. doi: 10.1523/JNEUROSCI.1806-08.2008.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

霍珀：一种用于生物数据草图绘制的数学最优算法。

Hopper: a mathematically optimal algorithm for sketching biological data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献