Wang Di, Hu Meiqi, Jin Yao, Miao Yuchun, Yang Jiaqi, Xu Yichu, Qin Xiaolei, Ma Jiaqi, Sun Lingyu, Li Chenxing, Fu Chuan, Chen Hongruixuan, Han Chengxi, Yokoya Naoto, Zhang Jing, Xu Minqiang, Liu Lin, Zhang Lefei, Wu Chen, Du Bo, Tao Dacheng, Zhang Liangpei
IEEE Trans Pattern Anal Mach Intell. 2025 Aug;47(8):6427-6444. doi: 10.1109/TPAMI.2025.3557581.
Accurate hyperspectral image (HSI) interpretation is critical for providing valuable insights into various earth observation-related applications such as urban planning, precision agriculture, and environmental monitoring. However, existing HSI processing methods are predominantly task-specific and scene-dependent, which severely limits their ability to transfer knowledge across tasks and scenes, thereby reducing the practicality in real-world applications. To address these challenges, we present HyperSIGMA, a vision transformer-based foundation model that unifies HSI interpretation across tasks and scenes, scalable to over one billion parameters. To overcome the spectral and spatial redundancy inherent in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450 K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, real-world applicability, and computational efficiency.
准确的高光谱图像(HSI)解释对于深入了解各种与地球观测相关的应用(如城市规划、精准农业和环境监测)至关重要。然而,现有的HSI处理方法主要是针对特定任务和场景的,这严重限制了它们跨任务和场景转移知识的能力,从而降低了在实际应用中的实用性。为了应对这些挑战,我们提出了HyperSIGMA,这是一种基于视觉Transformer的基础模型,它统一了跨任务和场景的HSI解释,可扩展到超过10亿个参数。为了克服HSIs中固有的光谱和空间冗余,我们引入了一种新颖的稀疏采样注意力(SSA)机制,该机制有效地促进了对多样上下文特征的学习,并作为HyperSIGMA的基本模块。HyperSIGMA使用专门设计的光谱增强模块集成空间和光谱特征。此外,我们构建了一个大规模的高光谱数据集HyperGlobal-450K用于预训练,其中包含约450K幅高光谱图像,在规模上显著超过现有数据集。在各种高级和低级HSI任务上进行的大量实验表明,与当前最先进的方法相比,HyperSIGMA具有通用性和卓越的表征能力。此外,HyperSIGMA在可扩展性、鲁棒性、跨模态转移能力、实际适用性和计算效率方面显示出显著优势。