We focus on the complexity and multi-scale importance of semantic segmentation of remote sensing images, a fundamental task in earth science research. We propose an architecture, TransUNetFormer, that is potent such as U-Net designed that extensively integrates convolutional neural network (CNN) + transformer fusion in both the encoder and decoder; this fusion emphasizes the significance of global contextual information and local feature details. TransUNetFormer achieves superior generalization for remote sensing image segmentation, particularly in capturing multi-scale features within its encoder-decoder architecture. The encoder incorporates design principles inspired by TransUNet, leveraging a CNN + transformer component for an efficient hybrid. In addition, a CNN + transformer hybrid block in the decoder, DP-hybrid, efficiently captures rich global-local features at each upsampling step. We introduce a fusion-concatenation module to dynamically generate weights during the interaction between the encoder and decoder, facilitating feature map fusion. Finally, an efficient feature refinement segmentation head is devised to purify shallow-stage encoder features and optimize the most profound global-local features in the decoder for fusion output. Experimental results on two widely used datasets, ISPRS Potsdam and LoveDA Urban, show the effectiveness and potential of TransUNetFormer. To our knowledge, this is the first hybrid CNN + transformer network specifically designed for remote sensing image segmentation. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Image segmentation
Transformers
Remote sensing
Semantics
Convolutional neural networks
Feature fusion
Head