Abstract: Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are ...
Abstract: We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current ...