Graikos Alexandros, Yellapragada Srikar, Le Minh-Quan, Kapse Saarthak, Prasanna Prateek, Saltz Joel, Samaras Dimitris
Stony Brook University.
Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Jun;2024:8532-8542. doi: 10.1109/cvpr52733.2024.00815. Epub 2024 Sep 16.
To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.
为了合成高保真样本,扩散模型通常需要辅助数据来指导生成过程。然而,在组织病理学和卫星图像等专业领域获取所需的细致的补丁级注释工作是不切实际的;这通常由领域专家进行,涉及数亿个补丁。现代自监督学习(SSL)表示编码了丰富的语义和视觉信息。在本文中,我们认为这种表示具有足够的表现力,可以作为细粒度人类标签的代理。我们引入了一种新颖的方法,该方法基于来自SSL的嵌入来训练扩散模型。我们的扩散模型成功地将这些特征投影回高质量的组织病理学和遥感图像。此外,我们通过组装从SSL嵌入推断出的空间一致的补丁来构建更大的图像,保留长程依赖性。通过生成真实图像的变体来增强真实数据,可以提高补丁级和更大的图像尺度分类任务的下游分类器准确性。我们的模型即使在训练期间未遇到的数据集上也很有效,证明了它们的鲁棒性和通用性。从学习到的嵌入生成图像与嵌入的来源无关。用于生成大图像的SSL嵌入可以从参考图像中提取,也可以从以任何相关模态(例如类别标签、文本、基因组数据)为条件的辅助模型中采样。作为概念验证,我们引入了文本到大型图像合成范式,在其中我们成功地从文本描述中合成了大型病理学和卫星图像。