cross-earth-logo CrossEarth:

Geospatial Vision Foundation Model for Domain Generalizable
Remote Sensing Semantic Segmentation

*Indicates Equal Contribution
1Sun Yat-sen University 2University of Science and Technology of China 3Wuhan University
4University of Tokyo 5KU Leuven 6KTH

Teaser for CrossEarth. Most existing RS methods focus on DA, which only adapts models to predefined target domains rather than enabling generalization to diverse unseen domains. On the other hand, existing RSDG methods cannot generalize well across various cross-domain scenarios. For this reason, we propose CrossEarth, the first VFM designed for RSDG, capable of bridging diverse domain gaps and effectively handling multiple segmentation scenarios.



Abstract

  • RSDG Importance: RSDG is a vital research area focused on creating models that can generalize across different scenarios in RS images.
  • Challenges in RSDG:
  • DA Limitations: Current methods focus on adapting to predefined domains rather than unseen ones.
  • Lack of RSDG Studies: There is a scarcity of studies addressing RSDG, where models struggle with underfitting in unknown scenarios.
  • In-Domain Performance Overemphasis: Existing VFMs prioritize performance within a domain over cross-domain generalization.
  • CrossEarth: We introduce CrossEarth, the first vision foundation model for RSDG semantic segmentation.
  • CrossEarth Advantages:
  • Data-Level Earth-Style Injection pipeline.
  • Model-Level Multi-Task Training pipeline.
  • RSDG Benchmark: We have created an RSDG benchmark with 28 cross-domain settings to test the generalizability of future models.
  • CrossEarth Performance: Extensive experiments show CrossEarth outperforms current state-of-the-art methods in RSDG semantic segmentation.
  • Benchmark Collection

    We categorize all benchmark datasets based on the domain gaps between the source and unseen domain datasets.

    Methods

    Does CrossEarth generalize well? How?


    (a) High-quality Representative Features

    CrossEarth extracts features that cluster closely for the same class across different domains, forming well-defined groups in feature space. Moreover, CrossEarth features exhibit high inter-class separability, forming unique clusters for each class.


    Feature

    (b) Data Manipulation + Representation Learning

    CrossEarth uses an Earth-Style Injection pipeline to create stylized and masked images, which are then processed by a Multi-Task Training pipeline for semantic segmentation and masked image modeling (MIM).


    Methods

    Refer to the pdf paper for more technical details of CrossEarth.

    Quantitative Performance

    Comparison on on Potsdam and Vaihingen Benchmarks

    Performance

    Refer to the pdf paper for more details on ablation studies and references.

    Qualitative Results

    BibTeX

    @article{crossearth,
          title={CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation},
          author={Gong, Ziyang and Wei, Zhixiang and Wang, Di and Ma, Xianzheng and Chen, Hongruixuan and Jia, Yuru and Deng, Yupeng and Ji, Zhenming and Zhu, Xiangwei and Yokoya, Naoto and Zhang, Jing and Du, Bo and Zhang, Liangpei},
          journal={arXiv preprint arXiv:2410.22629},
          year={2024}
        }