Sign language translation (SLT) is a key assistive technology inbridging communication barriers for Deaf and Hard-of-Hearing(DHH) individuals by converting visual-gestural language into spo-ken text. Sign languages have unique grammar and spatial structure(independent of spoken word order), and datasets have historicallybeen limited in size and domain scope. In this paper, we presentSignStreamNet, a novel hybrid architecture designed for stream-ing sign language video-to-text translation. Our model combinesa slow 3D convolutional network with a fast Vision Transformer(Swin) to capture both temporal dynamics and spatial detail. Byfusing slow-motion and fast visual features and employing a chunkwise streaming Transformer with Monotonic Chunkwise Attention(MoChA), our model can translate sign video to text in near realtime. Experiments on the German Sign Language PHOENIX-2014Tweather corpus and the Greek Sign Language (GSL) public servicedialogues demonstrate strong BLEU and ROUGE performance, sig-nificantly advancing the state of the art on these tasks. Moreover,our model demonstrates the potential for accessible, low-latencysign language translation systems suitable for real-world deploy-ment across diverse sign languages. This work opens the doortoward live SLT systems that make spoken content accessible toDHH users.
Chapisho hili linazingatia nchi zifuatazo.
Kulingana na maudhui ya chapisho hili, tunapendekeza rasilimali zifuatazo.
Rasilimali zifuatazo zinahusiana na chapisho hili kwa kunukuu.