FlowDubber Demo Page

ABSTRACT

Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of one brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which guarantees high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) ensures mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at here.

MODEL ARCHITECTURE

Overall framework of FlowDubber. It consists of LLM-based Semantic-aware Learning (LLM-SL), lip-phoneme Dual Contrastive Aligning (DCA), and Flow-based Voice Enhancing (FVE). Specifically, the LLM-SL includes Qwen2.5-0.5B speech language model and semantic-aware phoneme learning to keep pronunciation while aligning with DCA. The FVE consists of style flow matching prediction and LLM-based acoustics flow matching guidance to improve the acoustics quality.

EXPERIMENTS

Current SOTA Dubbing Baselines (All experimental results use the official code or providing checkpoint):
1) V2C-Net (CVPR'22) is a Vanilla dubbing model used for V2C tasks, which aggregates the desired speaker and visual embedding to Mel-Decoder.
2) HPMDubbing (CVPR'23) is a Hierarchical dubbing to bridge fine-grained video representations and speech attributes from three levels.
3) StyleDubber (ACL'24) is a SOTA Dubbing model using multi-scale style learning at the multi-modal phoneme level and acoustics utterance level.
4) Speaker2Dubber (ACM MM'24) is a SOTA pre-trained dubbing method with two-stage strategy to learn pronunciation from an additional TTS corpus.
5) ProDubber (CVPR'25) is a SOTA Dubbing model which first learn acoustic modeling ability from text-speech corpus then adapt the prosody to given videos.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting1 Sample1 (Full Video)
Text Content: So the reaction quotient is actually just a reaction product, the product of the two ions.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting1 Sample1 (See Lip-motion Details)
Text Content: So the reaction quotient is actually just a reaction product, the product of the two ions.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting1 Sample2 (Full Video)
Text Content: In fact, if we write the equilibrium expression for this, we'll find the equilibrium constant is less than one.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting1 Sample2 (See Lip-motion Details)
Text Content: In fact, if we write the equilibrium expression for this, we'll find the equilibrium constant is less than one.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
GRID Setting1 Sample1 (Full Video)
Text Content: Lay blue at q four now

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
GRID Setting1 Sample2 (Full Video)
Text Content: set red by s three please

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting2 Sample1 (Full Video)
Text Content: So when you mix these two, the reaction favors the products and forms a lot of ions.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting2 Sample1 (See Lip-motion Details)
Text Content: So when you mix these two, the reaction favors the products and forms a lot of ions.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting2 Sample2 (Full Video)
Text Content: So here's a simple chemical reaction, A going to B.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
Chem Setting2 Sample2 (See Lip-motion Details)
Text Content: So here's a simple chemical reaction, A going to B.

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
GRID Setting2 Sample1 (Full Video)
Text Content: Bin green at d nine soon

Ground-Truth	V2C-Net	HPMDubbing	StyleDubber	Speaker2Dubber	ProDubber	FlowDubber(Ours)
GRID Setting2 Sample2 (Full Video)
Text Content: Place white with b four soon

LLM-based Acoustics Flow Matching Guidance using different scale

Text Content: Well, we have a new unit here, watts, and that's a unit of power-- how much energy is transferred per second, how many

Sample #1

a = 0.0
a = 0.8

Text Content: We're looking at the sublimation of iodine solid to form iodine gas.

Sample #2

a = 0.0
a = 0.8

More Audio-Visual Visualizations

The visualization of the mel-spectrograms of ground truth (GT) and synthesized audios obtained by different models. In (a), green arrows point to the video frames that no speak, and green bounding boxes are used to highlight the pauses in the speech. In (b), Pink arrows point to the enhanced details of the mel-spectrogram as flow matching guidance scale increases.