T2VHE

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility,and Practicality

Tianle Zhang¹^,² , Langtian Ma³, Yuchen Yan⁴, Yuchen Zhang², Kai Wang², Yue Yang¹, Ziyao Guo¹, Wenqi Shao¹, Yang You², Yu Qiao¹, Ping Luo⁵, Kaipeng Zhang¹^†

¹Shanghai AI Laboratory, ²National University of Singapore
³University of Wisconsin-Madison, ⁴University of California San Diego, ⁵The University of Hong Kong

†Corresponding Author: zhangkaipeng@pjlab.org.cn

Because the number of videos uploaded is large, the loading may be slow. If this happens, please be patient and wait for a while.

arXiv Code

Overview

Overview of T2VHE. T2VHE comprises four key components: evaluation metrics, evaluation method, evaluator, and dynamic evaluation module. To ensure a comprehensive assessment of the T2V model, we meticulously devise a set of evaluation metrics, accompanied by precise definitions and corresponding reference perspectives. For ease of annotation, we employ a comparison-based scoring format as evaluation method and develop annotator training to ensure researchers can procure high-quality annotations using post-training LRAs. Furthermore, our protocol incorporates an optional dynamic evaluation component, enabling researchers to attain reliable evaluation results at reduced costs.

🔔News

🔥We release the code and data at github.

Introduction

Text-to-video (T2V) technology has made significant advancements in the last two years and garnered increasing attention from the general community. T2V products such as Gen2 and Pika have attracted many users. More recently, Sora, a powerful T2V model from OpenAI, further heightened public anticipation for the T2V technology. Predictably, the evaluation of the T2V generation will also become increasingly important, which can guide the development of T2V and assist the public in selecting appropriate models. This paper does a comprehensive paper survey and explores a human evaluation protocol for T2V generation.

Example of the annotation interface (left), wherein annotators choose the superior video based on provided evaluation metrics. Detailed explanation and reference perspectives of the “Video Quality" metric, as well as the corresponding examples and analysis process (right), annotators can make more accurate judgments by reading these guidelines and examples.

T2VHE

Evaluation metrics

Metric	Definition	Reference perspectives	Description	Type
Video Quality	Which video is more realistic and aesthetically pleasing?	Video Fidelity	Assess whether the video appears highly realistic, making it hard to distinguish from actual footage.	Objective
Video Quality	Which video is more realistic and aesthetically pleasing?	Aesthetic Appeal	Evaluate the artistic beauty and aesthetic value of each video frame, including color coordination, composition, and lighting effects.	Objective
Temporal Quality	Which video has better consistency and less flickering over time?	Content Consistency	Evaluate whether the subject's and background's appearances remain unchanged throughout the video.	Objective
Temporal Quality		Temporal Flickering	Assess the consistency of local and high-frequency details over time in the video.	Objective
Motion Quality	Which video contains motions that are more natural, smooth, and consistent with physical laws?	Movement Fluidity	Evaluate the natural fluidity and adherence to physical laws of movements within the video.	Objective
Motion Quality		Motion Intensity	Assess whether the dynamic activities in the video are sufficient and appropriate.	Objective
Text Alignment	Which video has a higher degree of alignment with the prompt?	Object Category	Assess whether the video accurately reflects the types and quantities of objects described in the text.	Objective
Text Alignment		Style Consistency	Evaluate whether the visual style of the video matches the text description.	Objective
Ethical Robustness	Which video demonstrates higher ethical standards and fairness?	Toxicity	Evaluate the video for any content that might be deemed toxic or inappropriate.	Subjectivity
		Fairness	Determine the fairness in the portrayal and treatment of characters or subjects across different social dimensions.
		Bias	Assess the presence and handling of biased content within the video.
Human Preference	As an annotator, which video do you prefer?	Video Originality	Evaluate the originality of the video's contents.	Subjectivity
		Overall Impact	Assess the emotional and intellectual value provided by the video.
		Personal Preference	Assess the video based on the previous five metrics and personal preferences.

Comprehensive evaluation metrics for T2V models. The table presents T2VHE's evaluation metrics, their definitions, corresponding reference perspectives, and types. When considering different indicators, annotators rely differently on reference angles in making their judgments.

Evaluation method

Regarding absolute scoring could still result in noisy annotations and pose challenges in reaching consensus among annotator, we use the less challenging comparative scoring method. Furthermore, traditional comparative scoring protocols rely on the win ratio in pair- wise comparison, however, this method has several drawbacks, i.e., introducing bias if models are not uniformly compared, does not reliably indicate the likelihood of one model outperforming another. To overcome these issues, we adopt the Rao and Kupper model, a probabilistic approach that allows for more efficient handling of the results of pairwise comparisons using less data than full comparisons. The estimation is conducted by maximizing the log-likelihood function: \[ l(p, \theta) = \sum_{i = 1 }^{t} \sum_{j = i + 1}^t \left( n_{ij} \log \frac{p_i}{p_i + \theta p_j} + n_{ji} \log \frac{p_j}{\theta p_i + p_j} + \tilde{n}_{ij} \log \frac{p_i p_j (\theta^2 -1)}{(p_i + \theta p_j)(\theta p_i + p_j)} \right), \] where $t$ is the number of models, $p = (p_1, \cdots, p_t)^T \in \mathbb{R}^t$ is the vector representing the scores of each model, $\theta$ is a tolerance parameter, $n_{ij}$ denote the number of times model $i$ is preferred to model $j$, and $\tilde{n_{ij}}$ denotes the number of times the two models reached a tie.

Evaluators

To ensure the quality of the annotations, for each metric, we design a set of annotator training, including their detailed definitions, corresponding reference perspectives, and for each reference perspective, we further provide the corresponding analysis process and examples to help the annotators to make decisions. At the same time, for different types of indicators, we have different notes to help annotators achieve a balance between objective criteria and subjective judgment. Below are training examples of objective metric "Video Quality" and subjective metric "Human Preference".

Example of training for "Video Quality"

Core Question: Which video is more realistic and aesthetically pleasing?

Note: Please select "Equal" only when both videos perform identically across all reference angles, and if there are conflicting views on the reference perspectives, please prioritize them in order.
For example, if the video on the left is more realistic and the video on the right is more aesthetically pleasing, the result should be "Left is Better".

Reference perspectives:

P1. Video Fidelity -- Assess whether the video appears highly realistic, making it hard to distinguish from actual footage.

- Example prompt: bat eating fruits while hanging
- Analysis: In the left video, the bats and fruits merge together, and in some frames three wings appear, these scenes are almost unseen in reality. By contrast, the scenes in the right video are comparatively more reasonable.
Conclusion: Right is better.

P2. Aesthetic Appeal -- Evaluate the artistic beauty and aesthetic value of each video frame, including color coordination, composition, and lighting effects.

- Example prompt: an aerial footage of a red sky
- Analysis: The left video features richer content with a more diverse selection and combination of colors, and excellent lighting effects. In contrast, the right video is relatively more monotonous, and its color coordination is less appealing.
Conclusion: Left is better.

Example of training for "Human Preference"

Core Question: As an annotator, which video do you prefer? If there are conflicting views on the reference perspectives, please prioritize them to your own liking.

Note: Due to the subjective nature of this criterion, this guide only offers possible perspectives for reference.

Reference perspectives:

P1: Video Originality -- Evaluate the originality of the video's contents.

- Example 1: Whether the video has its own innovative features based on corresponding real items?
- Example 2: Can the video be distinguished from traditional videos in terms of style or narrative technique?

P2: Overall impact -- Assess the emotional and intellectual value provided by the video.

- Example 1: Does the video evoke a strong emotional response, such as joy, sadness, or excitement?
- Example 2: Does the video stimulate intellectual curiosity or provide thought-provoking content?

P3: Personal Preference -- Assess the video based on the previous five metrics and personal preferences.

- Example 1: Does the artistic style of the video appeal to your personal taste?
- Example 2: Do the themes or the moral messages of the video resonate with your personal values or experiences?

Dynamic evaluation module

Evaluating multiple video models via traditional pairwise evaluation protocols becomes increasingly resource-intensive as the number of models expands. To efficiently obtain stable model rankings, we propose a pluggable dynamic evaluation module, the specific process is as follows:

Key Principles

Video Quality Proximity Rule
- Leverage the scoring results of automated metrics to allow human annotators to prioritize the annotation of samples that are difficult to distinguish with automated metrics.
- Ensures that initially evaluated video pairs have similar quality levels.
Model strength Rule
- Determines the evaluation priority of subsequent video pairs based on model strength scores.
- Reducing the number of comparisons between models with significant differences in strength to improve algorithmic efficiency.

Evaluation Process

Initial Model Strength Assignment
- Each model is assigned an initial neutral strength value, indicating no prior bias.
Automatic Metrics Computation
- For each video, the following metrics were evaluated by a pre-trained scorer: subject consistency, temporal flickering, motion smoothness, dynamic fegree, aesthetic quality, imaging quality, and overall consistency.
- Scores are normalized and summed to produce the feature score for each video.
Group Construction
- For each prompt, groups of model pairs are constructed.
- The absolute value of the difference between the scores of two videos in a pair is input into an exponential decay model.
- The output value is the score of each video pair, and the sum of these scores forms the total score for the group.
Sorting and Grouping
- Groups were ranked according to their total scores, with higher scores at the top indicating less variation within the group.
- This preprocessing step does not increase the cost of our evaluation protocol.
Human Evaluation Phase
- An initial set of video pairs is evaluated by humans, and model strengths are updated using the Rao and Kupper model.
- Comparisons are then split into batches. For each video pair, the absolute difference in scores between the two models is entered into an exponential decay model, whose output value is the probability that this pair will be discarded. After each batch, the model strength estimates were updated under the six evaluation dimensions.
- The evaluation ends when the model rankings stabilize, meaning the rankings of models under each dimension remain unchanged for several consecutive batches.

Experiment Results

Evaluation results

Model	Video Quality	Temporal Quality	Motion Quality	Text Alignment	Ethical Robustness	Human Preference
Pre-training LRAs
Gen2	3.33 (1)	2.63 (1)	2.03 (1)	1.57 (1)	1.36 (1)	2.87 (1)
Pika	1.11 (2)	1.71 (2)	1.37 (2)	1.03 (3)	1.08 (3)	1.21 (2)
Latte	0.67 (5)	0.79 (5)	0.84 (5)	1.03 (4)	1.00 (5)	0.77 (5)
TF-T2V	0.76 (3)	1.09 (3)	1.01 (4)	0.90 (5)	1.06 (4)	0.87 (4)
Videocrafter2	0.72 (4)	0.92 (4)	1.06 (3)	1.24 (2)	1.12 (2)	0.91 (3)
Post-training LRAs
Gen2	2.71 (1)	2.37 (1)	2.16 (1)	2.71 (1)	2.57 (1)	2.96 (1)
Pika	1.16 (2)	1.34 (2)	1.24 (2)	1.12 (3)	1.18 (3)	1.24 (2)
Latte	0.82 (5)	0.89 (4)	0.89 (4)	1.43 (2)	1.42 (2)	0.89 (3)
TF-T2V	0.91 (3)	1.00 (3)	0.95 (3)	0.82 (4)	0.86 (4)	0.85 (4)
Videocrafter2	0.82 (4)	0.83 (5)	0.89 (5)	0.68 (5)	0.73 (5)	0.76 (5)
AMT Annotators
Gen2	2.25 (1)	2.29 (1)	2.11 (1)	2.76 (1)	3.14 (1)	2.73 (1)
Pika	1.09 (2)	1.21 (2)	1.23 (2)	1.00 (3)	0.82 (3)	1.04 (2)
Latte	0.80 (5)	0.88 (4)	0.89 (4)	1.40 (2)	1.29 (2)	0.87 (3)
TF-T2V	0.90 (3)	0.88 (3)	0.91 (3)	0.71 (4)	0.49 (4)	0.71 (4)
Videocrafter2	0.86 (4)	0.76 (5)	0.87 (5)	0.51 (5)	0.29 (5)	0.56 (5)
Post-training LRAs (Dyn)
Gen2	2.75 (1)	2.42 (1)	2.30 (1)	2.90 (1)	2.66 (1)	2.98 (1)
Pika	1.22 (2)	1.46 (2)	1.35 (2)	1.21 (3)	1.23 (3)	1.31 (2)
Latte	0.86 (5)	0.97 (4)	0.92 (4)	1.62 (2)	1.53 (2)	0.98 (3)
TF-T2V	0.92 (3)	1.01 (3)	1.00 (3)	0.86 (4)	0.91 (4)	0.89 (4)
Videocrafter2	0.87 (4)	0.86 (5)	0.88 (5)	0.69 (5)	0.76 (5)	0.81 (5)

Scores and rankings of models across various dimensions for pre-training LRAs, AMT Annotators, and Post-training LRAs. Post-training LRAs (Dyn) refers to the annotation results of Post-training LRAs using the dynamic evaluation component.

We evaluated five state-of-the-art T2V model, as illustrated in the Table above and the Figure below, regardless of the annotator sources, closed-source models typically perform better.

Furthermore, we identify the following key observations:

1): The annotation results obtained by the pre-training LRAs markedly differ from those of the other three groups, evident in the discrepancy between the final model scores and rankings for each dimension. In addition, the annotation results of the trained LRAs closely mirror those of the AMT personnel, yielding consistent final model ranking outcomes. We also conduct a quality check of the annotation results for the LRAs before and after training.

2): In the annotated results from AMT personnel, Gen2 demonstrates significant superiority over other models across all metrics, while Pika also exhibits commendable performance across most metrics. In contrast, The performances of open-source models show less disparity in terms of video quality, temporal quality, and motion quality metrics. TF-T2V's generations typically excel in video quality and action timing, while Videocrafter2, an earlier open-source model, demonstrates notable proficiency in generating high-quality videos. However, distinctions among the three models become more apparent in the metrics of text alignment, ethical robustness, and human preference. Notably, Latte exhibits strong performance in text alignment and ethical robustness, even surpassing Pika.

Module validation

Our protocol, augmented by the dynamic evaluation module, cuts annotation costs to about 53% of the original expense while achieving comparable outcomes. Subsequent experiments further confirmed the module's effectiveness and reliability.

1): The dynamic component allows the annotations needed for the evaluation to grow nearly linearly as the number of models increases. As shown in the Figure above left, our protocol with dynamic component demonstrates a nearly linear growth in annotation demands as the number of models increases, greatly reducing the evaluation costs.

2): Dynamic evaluation algorithms guarantee that the most valuable samples are annotated. Before the start of the dynamic evaluation, annotators are required to annotate 200 video pairs where distinguishing differences between them based on automated metrics is challenging. This step ensures that the samples most deserving of human assessment will not be discarded in the dynamic assessment and that the initially estimated model scores are not biased by specific prompt types.

3): The bootstrap confidence intervals for score estimation further proves the validity of annotation results. As shown in the Figure above right, the confidence intervals for Latte, Pika, TF-T2V, and Videocrafter2 are consistently narrow, signifying precise estimations. At the same time, the confidence intervals for Gen2's scores are relatively wide, as our dynamic algorithm's frequent exclusion of comparisons involving Gen2 due to its significant superiority over the other models. Nevertheless, even at the lower bound of the confidence intervals, Gen2's score estimation remains superior to those of all other models. This highlights that the rank estimations remain robust despite some instability in score estimation.