Morphable Diffusion:
3D-Consistent Diffusion for Single-image Avatar Creation
CVPR 2024

TL;DR: We introduce a morphable diffusion model to enable consistent controllable novel view synthesis of humans from a single image. Given a single input image and a morphable mesh with a desired facial expression, our method directly generates 3D consistent and photo-realistic images from novel viewpoints, which we could use to reconstruct a coarse 3D model using off-the-shelf neural surface reconstruction methods such as NeuS2.

Overview Video

Pipeline

overview

Our pipeline consists of 4 parts:
A) a noise feature volume is constructed by lifting and processing the 2D noise features into a 3D volume $\mathbf{G}$ which is processed by a 3D CNN $\eta_\theta$ to produce a feature grid $\mathbf{F}_G$.
B) a morphable noise volume is constructed by attaching the 2D noise features onto mesh vertices that are processed by a SparseConvNet $f_\theta$ to output a 3DMM-aware feature volume $\mathbf{F}_V$.
C) the resulting feature grids $\mathbf{F}_G$ and $\mathbf{F}_V$ are blended via addition. The blended noise volume is further interpolated to the frustum $\mathbf{F}^{(i)}$ of a target view $(i)$ that we wish to synthesize.
D) the noisy target image $\mathbf{x}_t^{(i)}$, the input image $\mathbf{y}$, and the target feature frustum are processed by a pre-trained 2D UNet akin to SyncDreamer to predict the denoised image at the next iteration $\mathbf{x}_{t-1}^{(i)}$.

Novel View Synthesis of Faces

We test our method for single-view reconstruction of human faces on the FaceScape dataset and compare our method with EG3D, pixelNeRF, SSDNeRF, and SyncDreamer. For EG3D, we use a custom GAN inversion repository to search for the latent code that produces the best resemblance. Our method produces the best scores on the perceptual metrics while preserving accurate facial expressions and good resemblance, which is attributed to the effective conditioning on the morphable model.

Novel Facial Expression Synthesis

We test our method for novel facial expression synthesis from single image and compare our method with MoFaNeRF. Our method produces better resemblance.

We finetune our method on 16 views of the test subject in neutral expression and compare the finetuned model with DiffusionRig. Our finetuned model improves the resemblance slightly compared to the results with single input image, and produces images with better fidelity compared to the baseline.

Novel View Synthesis of Full Bodies

We also test our method for single-view reconstruction of human full-body images on the Thuman2.0 dataset and compare our method with pixelNeRF, zero-1-to-3, and SyncDreamer. Given a single full-body image in any pose, our method produces novel views of the person with the most accurate poses.

overview

Full-body Animation

We also show that our method could be applied to animate a full-body image using SMPL meshes as driving signals.

Citation

@article{chen2024morphable,
      title={Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation}, 
      author={Xiyi Chen and Marko Mihajlovic and Shaofei Wang and Sergey Prokudin and Siyu Tang},
      booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2024}
      }
                

Acknowledgements

We thank Korrawe Karunratanakul for the discussion on per-subject finetuning of diffusion models, Timo Bolkart for the advice on fitting FLAME model, and Malte Prinzler for the help with the color-calibrated FaceScape data. This project is partially supported by the SNSF grant 200021 204840. Shaofei Wang also acknowledges support from the ERC Starting Grant LEGO-3D (850533) and the DFG EXC number 2064/1 - project number 390727645.
The website template was borrowed from Michaël Gharbi.