Sign language translation (SLT) is a key assistive technology inbridging communication barriers for Deaf and Hard-of-Hearing(DHH) individuals by converting visual-gestural language into spo-ken text. Sign languages have unique grammar and spatial structure(independent of spoken word order), and datasets have historicallybeen limited in size and domain scope. In this paper, we presentSignStreamNet, a novel hybrid architecture designed for stream-ing sign language video-to-text translation. Our model combinesa slow 3D convolutional network with a fast Vision Transformer(Swin) to capture both temporal dynamics and spatial detail. Byfusing slow-motion and fast visual features and employing a chunkwise streaming Transformer with Monotonic Chunkwise Attention(MoChA), our model can translate sign video to text in near realtime. Experiments on the German Sign Language PHOENIX-2014Tweather corpus and the Greek Sign Language (GSL) public servicedialogues demonstrate strong BLEU and ROUGE performance, sig-nificantly advancing the state of the art on these tasks. Moreover,our model demonstrates the potential for accessible, low-latencysign language translation systems suitable for real-world deploy-ment across diverse sign languages. This work opens the doortoward live SLT systems that make spoken content accessible toDHH users.
以下资源通过引用与本出版物相关。