SignStreamNet: Streaming Sign Language Video-to-Text Translation for Accessibility

Ahmed, W.

doi:10.1145/3663547.3759422

2025

Makala ya Jarida

SignStreamNet: Streaming Sign Language Video-to-Text Translation for Accessibility

Muhtasari

Sign language translation (SLT) is a key assistive technology inbridging communication barriers for Deaf and Hard-of-Hearing(DHH) individuals by converting visual-gestural language into spo-ken text. Sign languages have unique grammar and spatial structure(independent of spoken word order), and datasets have historicallybeen limited in size and domain scope. In this paper, we presentSignStreamNet, a novel hybrid architecture designed for stream-ing sign language video-to-text translation. Our model combinesa slow 3D convolutional network with a fast Vision Transformer(Swin) to capture both temporal dynamics and spatial detail. Byfusing slow-motion and fast visual features and employing a chunkwise streaming Transformer with Monotonic Chunkwise Attention(MoChA), our model can translate sign video to text in near realtime. Experiments on the German Sign Language PHOENIX-2014Tweather corpus and the Greek Sign Language (GSL) public servicedialogues demonstrate strong BLEU and ROUGE performance, sig-nificantly advancing the state of the art on these tasks. Moreover,our model demonstrates the potential for accessible, low-latencysign language translation systems suitable for real-world deploy-ment across diverse sign languages. This work opens the doortoward live SLT systems that make spoken content accessible toDHH users.