Download PDFOpen PDF in browser

Multi-Frame Grid Perspective for Traffic Video Captioning and Context-Aware VQA

12 pagesPublished: April 19, 2026

Abstract

"What really happened? Who was at fault? Did the pedestrian yield, or was the driver distracted?" In high stakes traffic incidents, understanding pedestrian vehicle interactions is essential for safety assessment, post crash analysis, and insurance decision making. We propose a novel vision language framework for traffic safety captioning and visual question answering (VQA), designed for the AI City Challenge 2025. Leveraging LLaVA 1.5 as our base vision language model, we introduce a multi frame collage input strategy to embed temporal context into image based architectures. We explored three input transformation techniques, Box Stitch, Blur Stitch, and Arrow Stitch, to emphasize semantic cues such as entity localization, contextual filtering, and motion trajectory. Structured captions are generated through a two stage process: LLaVA extracts fine grained semantic features via targeted question answering, which are then converted into narrative descriptions using Mistral 7B. For VQA, Mistral further reasons over structured scene features to identify the most contextually appropriate response. Our best performing configuration, Box Stitch, achieves an S_2 score of 33.93 on the official test set, demonstrating the effectiveness of structured prompting, modular caption pipelines, and strategic visual input augmentation in understanding pedestrian vehicle interactions. This work highlights the promise of combining static visual backbones with image based temporal fusion for traffic scenario comprehension.

Keyphrases: multi view fusion, temporal context alignment, traffic scene understanding, video question answering, vision language models

In: Jernej Masnec, Hamid Reza Karimian, Parisa Kordjamshidi and Yan Li (editors). Proceedings of AI for Accelerated Research Symposium, vol 3, pages 65-76.

BibTeX entry
@inproceedings{AIAS2025:Multi_Frame_Grid_Perspective,
  author    = {Sanjita Prajapati and Ashutosh Dumka and Rajan Thakulla and Atmadip Goswami and Karo Ahmadi Dehrashid and Anuj Sharma},
  title     = {Multi-Frame Grid Perspective for Traffic Video Captioning and Context-Aware VQA},
  booktitle = {Proceedings of AI for Accelerated Research Symposium},
  editor    = {Jernej Masnec and Hamid Reza Karimian and Parisa Kordjamshidi and Yan Li},
  series    = {EPiC Series in Technology},
  volume    = {3},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2516-2322},
  url       = {/publications/paper/VHQR},
  doi       = {10.29007/3hbg},
  pages     = {65-76},
  year      = {2026}}
Download PDFOpen PDF in browser