Show HN: Run TRELLIS.2 Image-to-3D generation natively on Apple Silicon

Shivam Kumar ported Microsoft's TRELLIS.2, a 4-billion-parameter model for generating 3D meshes from single images, to run natively on Apple Silicon.

Shivam Kumar ported Microsoft’s TRELLIS.2, a 4-billion-parameter model for generating 3D meshes from single images, to run natively on Apple Silicon. This eliminates the need for NVIDIA CUDA hardware, cloud services, or specialized dependencies. On an M4 Pro with 24GB unified memory, it produces meshes with around 400,000 vertices in 3.5 minutes—slower than the seconds it takes on an H100 GPU, but fully offline and local.

TRELLIS.2 comes from Microsoft Research, building on sparse transformer architectures for efficient 3D reconstruction. Released in late 2024, it handles complex geometry from casual photos, outperforming earlier models like InstantMesh or SV3D in detail and speed on high-end GPUs. The catch: it demands CUDA, flash-attention for sparse ops, nvdiffrast for rasterization, and custom sparse convolutions—all NVIDIA-locked. Macs with M-series chips couldn’t touch it without heroic engineering.

The Port: Pure PyTorch Swaps

Kumar swapped out the CUDA roadblocks with vanilla PyTorch equivalents. He implemented a gather-scatter mechanism for sparse 3D convolutions, used Scaled Dot-Product Attention (SDPA) for sparse transformers, and rewrote mesh extraction in pure Python, ditching CUDA hashmaps. These changes span just a few hundred lines across nine files in the trellis-mac repo.

Installation is straightforward for PyTorch users on macOS. Clone the repo, install dependencies via pip—including torch with MPS support—and download the 4B model weights (about 8GB quantized). Run inference with a single image:

python generate.py --input_image path/to/photo.jpg --output_dir ./outputs

This leverages Apple’s Metal Performance Shaders (MPS) backend in PyTorch 2.5+, which maps tensor ops to the GPU efficiently. No Docker hacks or remote servers required. Early tests show mesh quality holds up visually, though subtle artifacts might appear from PyTorch approximations versus optimized CUDA kernels.

Performance Breakdown

On the M4 Pro (14-core CPU, 20-core GPU, 24GB RAM), full pipeline—from image encoding to mesh export—clocks 3.5 minutes for a 400K-vertex output. Peak memory usage hovers at 20GB, fitting snugly in higher-end MacBooks or minis. Compare to H100: under 30 seconds with 80GB HBM3, thanks to tensor cores and custom kernels.

Lower-spec machines struggle. An M3 Air with 16GB tops out at smaller meshes or crashes on VRAM limits. Inference scales linearly with batch size, but single-image use dominates. Quantization to 4-bit further shaves time to ~2.5 minutes without visible quality drop, per repo benchmarks.

Skepticism check: Ports like this can introduce numerical instability. TRELLIS.2’s sparse ops are finicky; PyTorch’s SDPA might lag in sparse efficiency, explaining the 100x slowdown. No public ablation studies yet—run your own A/B tests on the repo.

Why This Matters

Image-to-3D tech unlocks product viz, AR/VR assets, and custom printing from phone snaps. Before, you needed AWS/GCP bills ($2-5/hour for A100s) or waited for queues. Now, anyone with a 2024 MacBook Pro iterates privately, no data exfil to OpenAI or Replicate.

Privacy wins big: Generate sensitive prototypes—say, hardware designs or crypto wallet mockups—without uploading to black-box APIs. Security angle: Offline ML dodges supply-chain risks in cloud models. Cost? Amortizes at $0 after hardware buy-in; a 24GB M4 Mac runs $2,500, pays off in months for pros.

Broader ripple: Accelerates open-source 3D on consumer silicon. Expect forks for M-series diffusion models or Llama fine-tunes. Limits persist—no multi-view support yet, and topology can glitch on occluded scenes. Still, this cracks the NVIDIA moat, proving PyTorch MPS maturity for 4B-scale beasts. Grab the repo; test it yourself before hype swells.

Show HN: Run TRELLIS.2 Image-to-3D generation natively on Apple Silicon

The Port: Pure PyTorch Swaps

Performance Breakdown

Why This Matters

Related

Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB)

The Missing Bundler Features

Show HN: Faceoff – A terminal UI for following NHL games

Claude Desktop installs undocumented browser extensions for Chrome and other browsers

Theseus, a static Windows emulator

This Week in Plasma: Per-Screen Virtual Desktops and Wayland Session Restore