ICCV Poster Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement

Poster

Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement

Xin Shen · Xinyu Wang · Lei Shen · Kaihao Zhang · Xin Yu

[ Abstract ]

Abstract:

Cross-view isolated sign language recognition (CV-ISLR) addresses the challenge of identifying isolated signs from viewpoints unseen during training, a problem aggravated by the scarcity of multi-view data in existing benchmarks. To bridge this gap, we introduce a novel two-stage framework comprising View Synthesis and Contrastive Multi-task View-Semantics Recognition. In the View Synthesis stage, we simulate unseen viewpoints by extracting 3D keypoints from the frontal-view training dataset and synthesizing common-view 2D skeleton sequences with virtual camera rotation, which enriches view diversity without the cost of multi-camera setups. However, direct training on these synthetic samples leads to limited improvement, as viewpoint-specific and semantics-specific features remain entangled. To overcome this drawback, the Contrastive Multi-task View-Semantics Recognition stage employs the cross-attention mechanism and contrastive learning objective, explicitly disentangling viewpoint-related information from sign semantics, thus obtaining robust view-invariant representations. We evaluate our approach on the MM-WLAuslan dataset, the first benchmark for CV-ISLR, and on our extended protocol (MTV-Test) that includes additional multi-view data captured in the wild. Experimental results demonstrate that our method not only improves the accuracy of frontal-view skeleton-based isolated sign language recognition, but also exhibits superior generalization to novel viewpoints. The MTV-Test set and code will be publicly released here.

Live content is unavailable. Log in and register to view live content