MM-Vid: Video Understanding with GPT-4V(ision). Leveraging multimodal large language models for comprehensive video analysis and understanding.
An early exploration of using GPT-4V's vision capabilities for end-to-end video understanding tasks.