Interested in CR2C2 activities and engagement opportunities?
Comprehensive highway scene understanding and robust traffic risk inference are critical to advancing Intelligent Transportation Systems (ITS) and autonomous driving. However, traditional methods often face limitations in scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we present a novel structured prompting and knowledge distillation approach that enables the automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our method orchestrates two large Vision-Language Models (VLMs), ChatGPT-4o and o3-mini, through a structured Chain-of-Thought (CoT) prompting strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a smaller student VLM. The resulting 3B-parameter model, VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of interpreting low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced size, VISTA demonstrates strong performance across standard captioning benchmarks (BLEU-4, METEOR, ROUGE-L, and CIDEr) when compared to its teacher models. This work shows that effective knowledge distillation and structured multi-agent supervision can equip compact VLMs with advanced reasoning capabilities. VISTA’s lightweight architecture supports efficient deployment on edge devices, enabling real-time risk monitoring without the need for extensive infrastructure upgrades.
Link to the tool: https://github.com/SMIL-AI/VISTA
Related Publication: Y. Yang, N. Xu, and J. Yang, “Structured Prompting and Knowledge Distillation for Traffic Video Interpretation and Risk Inference,” submitted to 2026 TRB Annual Meeting (Paper Number: TRBAM-26-02440).