Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions

Malte Prinzler1,2, Egor Zakharov3, Vanessa Sklyarova1,2, Berna Kabadayi1, Justus Thies1,4,
1 Max Planck Institute for Intelligent Systems, Tübingen, Germany
2 Max Planck ETH Center for Learning Systems
3 ETH Zürich   4 Technical University of Darmstadt

Paper arXiv Video Code (coming soon) Cite Mail

Joker uses one reference image to generate a 3D reconstruction with a novel extreme expression. The target expression is defined through 3DMM parameters and text prompts. The text prompts effectively resolve ambiguities in the 3DMM input and can control emotion-related expression subtleties and tongue articulation.



Abstract

We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference’s identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.




Method

Overview: Our Method consists of 2 components: a 2D Diffusion-Based Prior and a Progressive 3D Distillation procedure. The 2D prior predicts images of the reference identity while expression and pose are controlled through a 3DMM and text prompts. During the Progressive 3D Distillation, predictions of the 2D prior are used to generate a 3D reconstruction.


2D Diffusion-Based Prior: Our 2D Diffusion-Based Prior consists of 3 components all of which are constructed as UNets which are initialized from pretrained StableDiffusion 1.5 weights. The Denoising Network takes latent noise and a driving prompt as input and generates the output images in a standard latent-diffusion denoising process. The Reference Network takes the reference image as input, and passes it through its UNet structure. The keys and values of the self-attention layers are appended to their respective counterparts in the Denoising Network (see DiffPortrait3D for more details). The Driving Network is constructed as a ControlNet to drive the Denoising Network with normal map renderings of the 3DMM.

Progressive 3D Distillation: During our Progressive 3D Distillation, predictions of the 2D Prior are used to optimize a 3D reconstruction (NeRF). It is performed in two stages that differ in how the target images are generated and how frequently they are updated.

Stage1:Progressive 3D Distillation Stage 1 In Stage 1, the NeRF is rendered under several views. The renderings are noised according to discrete noise levels from a DDIM noise schedule and denoised in a single inference step using the 2D prior. The resulting images are used as targets to optimize the NeRF for several iterations. This procedure is repeated several times while the amplitude of the applied noise decreases monotonically. Stage 1 is interrupted after ~60% of the noise levels and followed by Stage 2.

Stage2:Progressive 3D Distillation Stage 2 In Stage 2, we initalize the NeRF from the distillation result of Stage 1. Again the NeRF is rendered under several views and the renderings are noised. This time, the target images are predicted in a multi-step DDIM denoising process over the remaining noise levels of the predefined noise schedule. The NeRF is optimized against the resulting images until convergence. We found that this 2-staged approach improves view-consistency and synthesis quality of the 3D reconstruction.


Baseline Comparisons



More Results

Paper

BibTeX


@misc{prinzler2024joker,
      title={Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions},
      author={Malte Prinzler and Egor Zakharov and Vanessa Sklyarova and Berna Kabadayi and Justus Thies},
      year={2024},
      eprint={2410.16395},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.16395},
}