Figure 1: Architecture of the proposed EmoDubber, which consists of four main components: Lip-related Prosody Aligning (LPA) focuses on learning inherent consistency between lip motion and phoneme prosody by duration level contrastive learning; Pronunciation Enhancing (PE) fuses the output of LPA with expanding phoneme sequence by efficient conformer; Speaker Identity Adapting (SIA) aims to generate acoustics prior information μ while injecting speaker style; and Flow-based User Emotion Controlling (FUEC) renders user-specified emotion and intensity E in the flow-matching prediction process using positive and negative guidance.
Sample Chem (Full Video) | ||||||
Text Content: so when you mix these two, the reaction favors the products and forms a lot of ions. | Ground-Truth | FastSpeech2 | Zero-shot TTS | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample Chem (Full Video) | ||||||
Text Content: well let's just calculate from the ideal gas law. | Ground-Truth | FastSpeech2 | Zero-shot TTS | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample Chem (Full Video) | ||||||
Text Content: i can do that from the number of moles, the temperature, and the size of the flask. |
Ground-Truth | FastSpeech2 | Zero-shot TTS | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|---|---|---|---|---|---|
Sample Chem (Full Video) | ||||||
Text Content: so already we have a prediction from our quantum mechanical understanding of bonding. | Ground-Truth | FastSpeech2 | Zero-shot TTS | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample GRID (Full Video) | ||||||
Text Content: lay blue with d seven now | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample GRID (Full Video) | ||||||
Text Content: set white by i six please | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample GRID (Full Video) | ||||||
Text Content: place red at g eight soon | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample GRID (Full Video) | ||||||
Text Content: lay blue by t eight now | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample V2C-Animation (Full Video) | ||||||
Text Content: now what? | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample V2C-Animation (Full Video) | ||||||
Text Content: mom! dad! | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Sample V2C-Animation (Full Video) | ||||||
Text Content: I win! Ha! | Ground-Truth | Zero-shot TTS | V2C-Net | HPMDubbing | Speaker2Dub | StyleDubber | EmoDubber |
---|
Figure 2: Intensity performance of EmoDubber on Chem. The horizontal axis shows the positive guidance α, and vertical axis displays the Intensity Score (IS), with different curves for various negative guidance β. Higher IS indicate stronger emotional intensity in audio.
Figure 3: Intensity performance of EmoDubber on GRID. The horizontal axis shows the positive guidance α, and vertical axis displays the Intensity Score (IS), with different curves for various negative guidance β. Higher IS indicate stronger emotional intensity in audio.
Figure 4: Intensity performance of EmoDubber on V2C-Animation. The horizontal axis shows the positive guidance α, and vertical axis displays the Intensity Score (IS), with different curves for various negative guidance β. Higher IS indicate stronger emotional intensity in audio.
Chem |
|
---|
GRID |
|
---|
V2C |
|
---|
Figure 5: The proposed Emodubber can achieve high-quality audio-visual alignment and clear pronunciation (top part of the figure), recreate audio with desired emotions based on the user’s instructions (middle part of figure), and control the emotional intensity that allows users to make more fine-grained edits to sentiment (bottom part).
Figure 6: The ``green check marks'' represent the correct audio-visual synchronization consistent with the ground truth mel-spectrogram, while the ``gray crosses'' represent failure examples that misalign with ground truth. Through the observation of the highlight regions, it is evident that our model outperforms other SOTA dubbing methods in maintaining audio-visual alignment, which is notably closer to ground truth dubbing. For instance, our method can capture the natural pauses in the speaker's speech (see Chem Benchmark VzvinAckmQU-022 and 7W0cz0oGHGE-005), and our method can reason the correct starting and ending points of speakers in GRID Benchmark (s24-bbbj9p and s26-sgbg7p), which are significant to maintain audio-visual alignment. In contrast, other advanced dubbing methods cannot achieve this.