# PAct Reproduction Notes in `trellis2`

Date: 2026-04-18 (UTC)

## 1. Paper and code alignment

After reading the paper and the released inference code, the high-level pipeline matches the paper well:

- Stage 1 predicts a part-decomposed sparse structure from a single image.
- Stage 1 uses part-aware conditioning: image features from DINOv2 plus semantic part masks.
- The Stage 1 transformer alternates global attention and within-part local attention.
- Stage 2 uses the Stage 1 structure as conditioning to generate part-level structured latents and decode them into 3D part geometry/appearance.
- Articulation is predicted from Stage 2 denoiser features, not just Stage 1 features.
- The default released inference settings match the paper:
  - `ss_steps=25`
  - `slat_steps=25`
  - `ss_cfg_strength=7.0`
  - `slat_cfg_strength=7.0`
  - articulation from averaged features over the last 20 denoising steps

## 2. Runtime environment used

- Conda env: `trellis2`
- Python: `3.10.19`
- PyTorch: `2.6.0+cu124`
- GPU tested: NVIDIA H100 80GB

## 3. Environment changes made

The following changes were made inside `trellis2` during reproduction:

1. Installed Gaussian rasterization extension required by PAct rendering:

```bash
mkdir -p /tmp/extensions
git clone https://github.com/autonomousvision/mip-splatting.git /tmp/extensions/mip-splatting
python -m pip install /tmp/extensions/mip-splatting/submodules/diff-gaussian-rasterization/ --no-build-isolation
```

Installed package:

- `diff_gaussian_rasterization==0.0.0`

2. Installed PDF parser used to read the local paper:

```bash
python -m pip install pypdf
```

Installed package:

- `pypdf==6.10.2`

Notes:

- `diffoctreerast` was not installed because the released inference path exercised here did not require it.
- First-time inference also downloads model assets from Hugging Face and DINOv2 weights from `torch.hub`, but those are cache downloads, not package installs.

## 4. Issues encountered and fixes applied

### Issue A: EXR mask shape mismatch

Initial failure:

- `IndexError: boolean index did not match indexed array along dimension 0`

Cause:

- Sample `.exr` masks were loaded as shape `(1, 518, 518, 3)`, while the code assumed `(518, 518, C)`.

Fix:

- In `modules/pact/datasets/components.py`, squeeze singleton dimensions before extracting the semantic mask channel.

### Issue B: `--data_dir` argument was ignored

Cause:

- `infer_imgs.py` accepted `--data_dir` but actually hard-coded `assets/real_world_examples`.

Fix:

- In `infer_imgs.py`, replaced the hard-coded path with `cfg.data_dir`.

### Issue C: missing rendering extension

Failure:

- `ModuleNotFoundError: No module named 'diff_gaussian_rasterization'`

Fix:

- Installed `diff_gaussian_rasterization` from the `mip-splatting` submodule shown above.

### Issue D: exploded-part video saved with image writer

Failure:

- `ValueError: Image must be 2D (grayscale, RGB, or RGBA).`

Cause:

- `exploded_video` is multi-frame video data, but the code used `imageio.imwrite`.

Fix:

- In `modules/pact/pipelines/pact_i23d_gen_pipe.py`, changed exploded-part saving to `imageio.mimsave(..., fps=video_fps)`.

### Issue E: Pillow/WebP plugin crash during PNG saving

Failure:

- `AttributeError: module 'PIL._webp' has no attribute 'HAVE_WEBPANIM'`

Cause:

- The current Pillow build in `trellis2` fails during plugin initialization, which breaks `imageio.imwrite(...png)` and PIL-backed image saving.

Fix:

- In `modules/pact/pipelines/pact_i23d_gen_pipe.py`, replaced the affected PNG writes with `cv2.imwrite(...)`.
- This was applied to both per-sample conditioning PNGs and the conditioning grid save path.

## 5. Successful reproduction commands

### Single-sample end-to-end check

I created a temporary one-sample input folder and ran:

```bash
CUDA_VISIBLE_DEVICES=0 python infer_imgs.py \
  --data_dir /tmp/pact_single \
  --outdir outputs/repro_single_success_v3 \
  --batch_size 1 \
  --no-save_video_grid \
  --no-save_cond_vis_grid
```

Successful output directory:

- `outputs/repro_single_success_v3/seed42_slatcfg7.0_sscfg7.0_sssteps25_slatsteps25_artioutmean_feature_regression_steps`

Produced files:

- `00__articulation_animation.mp4`
- `00__exploded_part.mp4`
- `00__exploded_part.png`
- `run_command.txt`

### Full official example batch

The released official example set also completed successfully:

```bash
CUDA_VISIBLE_DEVICES=0 python infer_imgs.py \
  --data_dir assets/real_world_examples \
  --outdir outputs/repro_full_official \
  --batch_size 1 \
  --no-save_video_grid \
  --no-save_cond_vis_grid
```

Successful output directory:

- `outputs/repro_full_official/seed42_slatcfg7.0_sscfg7.0_sssteps25_slatsteps25_artioutmean_feature_regression_steps`

Observed result:

- All `22/22` batches completed successfully.
- Runtime was about `2m31s` on one H100.
- Output file counts:
  - `44` mp4 files
  - `22` png files
  - `1` txt file

This corresponds to:

- one articulation animation mp4 per sample
- one exploded-part mp4 per sample
- one exploded-part png per sample
- one `run_command.txt`

## 6. Source files modified during reproduction

Only the following source files were intentionally modified:

- `infer_imgs.py`
- `modules/pact/datasets/components.py`
- `modules/pact/pipelines/pact_i23d_gen_pipe.py`

## 7. Extra notes

- The negated boolean CLI flags in this script use underscores, not hyphens:
  - valid: `--no-save_video_grid`
  - valid: `--no-save_cond_vis_grid`
  - invalid: `--no-save-video-grid`
- The current reproduction covered the released inference path. I did not validate training, mesh export, or URDF export in this pass.
