Video Instruction Tuning with Synthetic Data (2024)

Blog Series


Yuanhan Zhang, Jinming Wu‡, Wei Li♮, Bo Li, Zejun Ma♮
Ziwei Liu*, Chunyuan Li♮*

♮ByteDance, NTU S-Lab, ‡BUPT
Work collaborated with ByteDance * Co-senior authors

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Click on the sections below to learn more about this project :

  1. §Video Instruction-Following Data Synthesis
  2. §Video Representation
  3. §Benchmark Performance
  4. §Interactive Demos

Video Instruction-Following Data Synthesis

A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

Video Sources

We noticed that although different video-language datasets focus on various video understanding tasks , most are sourced from ten main video sources, which offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in figure below. We select the dynamic video from these source, we detail the video selection logic in the paper.

Video Instruction Tuning with Synthetic Data (1)

Automated Generation for Video Detail Description

For selected videos, we use GPT-4o to systematically describe their content. We start by sampling video frames at one frame per second (fps). However, due to the input size constraints of GPT-4o, we cannot use all sampled frames. Instead, we describe the videos sequentially, as shown in figure below. We create descriptions at three distinct levels, detailed below.

Video Instruction Tuning with Synthetic Data (2)

Automated Generation for Video Question Answering

In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions. This setup improves the video understanding model's ability to handle real-life queries. We refer to public video question-answering benchmarks to organize these questions into 16 specific categories, as shown in Figure 3. Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question. Please refer to the paper for more details of the question types and the generation process.

Video Instruction Tuning with Synthetic Data (3)

Dataset Statistics

We carefully select from our collected data sources to form a balanced and comprehensive collection, resulting in a total of 178K videos and 1.3M instruction-following samples. This includes 178K captions, 960K open-ended QAs, and 196K multiple-choice QAs.

Video Instruction Tuning with Synthetic Data (4)
Video Instruction Tuning with Synthetic Data (5)

Dataset Comparison

We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.

  1. A broad collection of dynamic videos. In terms of video sources, although LLaVA-Hound contains the largest number of videos, 44% of its video data are sourced from WebVid, where most videos are static. ShareGPT4Video includes 30% of its videos from Pexels, ,Pixabay, and Mixkit, which are aesthetically good but also mostly static. Additionally, the majority of its videos come from Panda-70M, which are short clips from longer videos, suggesting simpler plots. In contrast, we carefully select video sources that offer dynamic, untrimmed videos with complex plots, which are crucial for developing a powerful video understanding model.
  2. High frames per second. Regarding frame sampling in language annotations, the proposed dataset considers 1 FPS, while other datasets consider much lower FPS. LLaVA-Hound uniformly samples 10 frames from videos of any length. The average FPS is 0.008, which may miss some fine details. ShareGPT4Video picks key frames using CLIP based on frame uniqueness. This method might also miss subtle changes in the video because CLIP embeddings do not capture fine-grained dynamics well. Our method samples FPS=1 without using key frame selection algorithms, ensuring that detailed temporal information can be expressed in annotations with high coverage.
  3. Diverse tasks. The proposed dataset considers three common task types, including caption, free-form, and closed-form QA, while existing datasets only consider a subset. Meanwhile, the quality and number of samples in our dataset is higher.
Text Video Source #Video Total Video Length Average FPS #Caption #OE QA #MC QA
LLaVA-Hound GPT-4V 900K 3Khr 0.008 900K 900K 0
ShareGPT4Video GPT-4V 40K 0.2Khr 0.15 40K 0 0
LLaVA-Video-178K GPT-4o 178K 2Khr 1 178K 960K 196K

Video Representation

Following the classic SlowFast idea in video representations, we develop \(\text{LLaVA-Video}_{~\mathtt{SlowFast}}\) to optimize the balance between the number of frames and the count of visual tokens, within the budget of the limited context window in LLM and GPU memory for video representation.

Specifically, we categorize the frames into two groups, based on the strike rate \(s\), where every \(s\) frames are uniformly selected to form the slow frame group, and the rest of the frames are considered as the fast frame group. Note that a special case \(s=1\) leads to only one group, reducing the SlowFast representation to the original simple representation. For each group, we apply different pooling rates using the PyTorch function \(\mathtt{avg\_pool2d}()\). We apply \(p \times p\) pooling and \(2p \times 2p\) pooling for slow and fast frames, respectively.

To summarize, we parameterize the video representation configuration as \(\mathcal{V} = (T, M, s, p)\).

Video Instruction Tuning with Synthetic Data (6)

Benchmark Performance

We fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videos shorter than three minutes. These datasets were selected to improve our model’s performance, contributing to a total of 1.6 million video-language samples, which include 193,510 video descriptions, 1,240,801 open-ended questions, and 215,625 multiple-choice questions. Remarkably, 92.2% of the video descriptions, 77.4% of the open-ended questions, and 90.9% of the multiple-choice questions were newly annotated. Additionally, we used 1.1 million image-language pairs from the LLaVA-OneVision model.

Caption Open-Ended Q&A Multiple-Choice Q&A
Model VideoDC Dream-1K ActNet-QA VideoChatGPT EgoSchema MLVU MVBench NExT-QA PerceptionTest LongVideoBench VideoMME
test test test test test m-avg test mc val val wo/w-subs
Proprietary models
GPT-4V 4.06 34.4 57.0 4.00 - 49.2 43.5 - - 61.3 59.9/63.3
GPT-4o - 39.2 - - - 64.6 - - - 66.7 71.9/77.2
Gemini-1.5-Flash - 34.8 55.3 - 65.7 - - - - 61.6 70.3/75.0
Gemini-1.5-Pro - 36.2 57.5 - 72.2 - - - - 64.0 75.0/81.3
Open-source models
VILA-40B 3.37 33.2 58.0 3.36 58.0 - - 67.9 54.0 - 60.1/61.1
PLLaVA-34B - 28.2 60.9 3.48 - - 58.1 - - 53.2 -
LongVA-7B 3.14 - 50.0 3.20 - 56.3 - 68.3 - - 52.6/54.3
IXC-2.5-7B - - 52.8 3.46 - 37.3 69.1 71.0 34.4 - 55.8/58.8
LLaVA-OV-7B 3.75 31.7 56.6 3.51 60.1 64.7 56.7 79.4* 57.1 56.5 58.2/61.5
VideoLLaMA2-72B - 27.1 55.2 3.16 63.9 61.2 62.0 - - - 61.4/63.1
LLaVA-OV-72B 3.60 33.2 62.3 3.62 62.0 68.0 59.4 80.2* 66.9 61.3 66.2/69.5
LLaVA-Video-7B 3.66 32.5 56.5* 3.52 57.3 70.8 58.6 83.2* 67.9* 58.2 63.3/69.7
LLaVA-Video-72B 3.73 34.0 63.4* 3.62 65.6 74.4 64.1 85.4* 74.3* 61.9 70.5/76.9

Conclusion

This study introduces the LLaVA-Video-178K dataset, a high-quality synthetic dataset for video-language instruction-following. It is favored for its dense frame sampling rate in longer, untrimmed videos, covering diverse tasks such as captioning, open-ended and multi-choice QA. By training on the joint dataset of LLaVA-Video-178K with existing visual instruction tuning data, we developed a new model family, LLaVA-Video, which also considers video representation to effectively use GPU resources. This allows us to include more frames in the training process. The experimental results have demonstrated the effectiveness of the proposed synthetic dataset, and LLaVA-Video models have achieved excellent performance on a wide range of video benchmarks.

Interactive Demos

We provide interactive demos to showcase the capabilities of LLaVA-Video for realistic multimodal interactions.

LLaVA-Video teaches me how to download "TikTok" on my iPhone, step by step.

LLaVA-Video helps me find the healthy drink in the living room, and describe the living room.

Related Blogs

  • LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
  • LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
  • LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
  • LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data?
  • LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models
  • LLaVA-OneVision: Easy Visual Task Transfer
  • LLaVA-Critic: Learning to Evaluate Multimodal Models
  • Accelerating the Development of Large Multimodal Models with LMMs-Eval

Citation

@misc{zhang2024videoinstructiontuningsynthetic, title={Video Instruction Tuning With Synthetic Data}, author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li}, year={2024}, eprint={2410.02713}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.02713}, }
Video Instruction Tuning with Synthetic Data (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Terence Hammes MD

Last Updated:

Views: 6480

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.