Abstract: Video question answering has become a cornerstone task for evaluating vision language models. However, existing models often fail to ground their answers in relevant visual evidence or ...
Abstract: This work proposes a novel hybrid deep learning model that incorporates CNNs, Transformer models, and GRU layers, further extended with an MoE model, for advanced sequence analysis. The ...