We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18–23% for clothed-mesh reconstruction, PA-V2V drops by 6–27% for SMPL-X estimation, LPIPS decreases by 15–27% for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.
Overview of our Network Architecture. Given $N$ uncalibrated human images, our HART transformer first maps input images $\{ I_i \}_{i=1}^N$ into per-pixel point maps $\hat{p}_i$, refined normal maps $\hat{\mathbf{n}}_i$, SMPL-X tightness vectors $\hat{\mathbf{v}}_i$ and body part labels $\hat{l}_i$. The oriented point maps $\hat{p}_i, \hat{\mathbf{n}}_i$ for all views are merged and converted to an indicator grid $\chi_{\mathrm{refined}}$ via Differentiable Poisson Surface Reconstruction (DPSR). A 3D-UNet $g_{\theta}$ is used for grid refinement to account for self-occlusions and a clothed mesh reconstruction $\mathbf{M}_{\mathrm{clothed}}$ can be obtained by running marching cubes. The SMPL-X tightness vectors and label maps are aggregated into body markers $\hat{\mathbf{m}}$ out of which we could optimize a SMPL-X mesh $\mathbf{M}_{\mathrm{SMPL\text{-}X}}$.
We show examples of clothed mesh reconstruction from 4 views on test subjects of the THuman 2.1 and 2K2K datasets, and compare our method with PuzzleAvatar, MAtCha, and VGGT finetuned on our training data. For better visualization of the VGGT results, we calculate normal maps from the predicted point maps and apply Screened Poisson surface reconstruction to obtain surface meshes. The input views are shown in the top-right corners.
We show examples of SMPL-X estimation from 4 views on test subjects of the THuman 2.1 and 2K2K datasets, and compare our method with EasyMocap, Multi-view SMPLify-X, and ETCH. For ETCH, we adopt it as a postprocessing step on our clothed mesh predictions. The input views are shown in the top-right corners.
We present novel view synthesis results on the DNA-Rendering dataset with real-world human captures under 4-, 6-, and 8-view settings, comparing against MAtCha, the strongest competing baseline.
The authors thank Sanghyun Son and Xijun Wang for the fruitful discussions, and Jianyuan Wang for addressing technical questions about VGGT. This research is supported in part by Dr. Barry Mersky E-Nnovate Endowed Professorship, Capital One E-Nnovate Endowed Professorship, and Dolby Labs.
@article{chen2025hart,
title={HART: Human Aligned Reconstruction Transformer},
author={Chen, Xiyi and Wang, Shaofei and Mihajlovic, Marko and Kang, Taewon and Prokudin, Sergey and Lin, Ming},
journal={arXiv preprint arXiv:2509.26621},
year={2025}
}