MATCH: Feed-forward Gaussian Registration for Head Avatar Creation and Editing

Abstract

We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh.

To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline.

Reconstruction Examples

Video

Method Overview

MATCH is a fast, feed-forward method designed to reconstruct photorealistic 3D human heads from calibrated multi-view images. In under a second, the network processes multi-view inputs and outputs a UV texture of 2D Gaussian splat parameters. These parameters are in dense semantic correspondence such that each Gaussian consistently represents the same semantic region (e.g., the nose tip), regardless of identity or expression.

MATCH Pipeline

The model leverages a transformer architecture that efficiently fuses 2D image data with 3D spatial awareness.

Image Tokenization: High-resolution input images are divided into manageable patches. These are fused with semantic features (extracted via Sapiens) and spatial ray coordinates to create dense Image Tokens.
Coarse Mesh Registration: The model uses a pretrained network (an adapted TEMPEH model) to estimate an initial, coarse 3D mesh of the subject's head directly from the input images.
UV Tokenization: Using the coarse mesh, the we calculate 3D location and color texture maps which are devided into non-overlapping patches and projectd to UV tokens.

Image and UV tokens are processed by a sequence of two alternating attention blocks.
Registration-Guided Attention: To keep computation efficient—even as the number of input images increases—the transformer restricts attention based on physical visibility. Each UV token only attends to the specific image tokens that display its corresponding facial region, determined by a calculated correspondence score.

Grouped Attention: This block processes UV tokens and individual camera views separately. This helps seamlessly propagate information to unobserved or occluded regions of the head and perform image-space feature processing.
Gaussian Splat Decoding: The fully processed UV tokens are projected into a final output texture. Every texel in this map provides the precise parameters—color, opacity, location, scale, and rotation—needed to render the final 3D Gaussian splat representation.

Animatable Avatar Creation

Beyond static 3D reconstruction, MATCH accelerates the creation of lightweight, animatable GEM avatars by a factor of 10.

Bypassing Expensive Tracking: By inherently predicting Gaussians in dense semantic correspondence, the method skips the computationally heavy mesh-tracking and CNN-optimization steps typically required by frameworks like GEM.
Rapid Distillation: The model rapidly infers sequences of Gaussian textures from multi-view video. These textures are unposed into a canonical space and compressed using Principal Component Analysis (PCA).
Controllable Output: We optimize a regressor network that estimates the PCA coefficients from monocular videos, yielding high-quality, animatable avatars.

Baseline Comparisons

Static 3D Reconstruction

Animatable Head Avatars

BibTeX

@article{prinzler2026match,
  title={Feed-forward Gaussian Registration for Head Avatar Creation and Editing},
  author={Prinzler, Malte and Gotardo, Paulo and Tang, Siyu and Bolkart, Timo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

MATCH Feed-forward Gaussian Registration for Head Avatar Creation and Editing (CVPR 2026)