ICCV Poster RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction

Poster

RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction

Baojie Fan · Xiaotian Li · Yuhan Zhou · Yuyu Jiang · Jiandong Tian · Huijie Fan

[ Abstract ]

Abstract:

The multi-modal 3D semantic occupancy task provides a comprehensive understanding of the scene and has received considerable attention in the field of autonomous driving. However, existing methods mainly focus on processing large-scale voxels, which bring high computational costs and degrade details. Additionally, they struggle to accurately capture occluded targets and distant information. In this paper, we propose a novel LiDAR-Camera 3D semantic occupancy prediction framework called RIOcc, with collaborative feature refinement and multi-scale cross-modal fusion transformer. Specifically, RIOcc encodes multi-modal data into a unified Bird's Eye View (BEV) space, which reduces computational complexity and enhances the efficiency of feature alignment. Then, multi-scale feature processing substantially expands the receptive fields. Meanwhile, in the LiDAR branch, we design the Dual-branch Pooling (DBP) to adaptively enhance geometric features across both the Channel and Grid dimensions. In the camera branch, the Wavelet and Semantic Encoders are developed to extract high-level semantic features with abundant edge and structural information. Finally, to facilitate effective cross-modal complementarity, we develop the Deformable Dual-Attention (DDA) module. Extensive experiments demonstrate that RIOcc achieves state-of-the-art performance, with 54.2 mIoU and 25.9 mIoU on the Occ3D-nuScenes and nuScenes-Occupancy datasets, respectively.

Live content is unavailable. Log in and register to view live content