SpheRoPE: Zero-Shot Optimization-Free 360° Panorama Generation with Spherical RoPE

Abstract

We present a zero-shot, training-free and optimization-free framework for generating 360° panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training; however, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact 2π periodicity. Coupled with a novel Semantic Distortion CFG that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach is versatile and can be applied to diverse backbones and 360° generation modalities. We demonstrate this across text-to-panorama and image-to-panorama tasks using Flux.1, Flux.2, and LTX-Video backbones.

Highlights

Zero-Shot, Training-Free & Optimization-Free

A single framework for 360° image and video generation that injects spherical priors into pre-trained Diffusion Transformers at inference time — no fine-tuning, no per-sample optimization, no architectural changes.

Backbone- & Modality-Agnostic

The same two test-time components plug into Flux.1, Flux.2, and LTX-Video, covering text-to-panorama and image-to-panorama tasks without retraining, and inherit the full creative breadth of each base model.

Method

Two orthogonal, test-time-only components that inject spherical geometry into pre-trained diffusion transformers.

PCA visualization of RoPE: linear RoPE creates a seam at the ±π meridian and disjoint polar embeddings, while our Spherical RoPE wraps seamlessly with uniform polar convergence. — **PCA visualization of RoPE.** (a) Linear RoPE creates a seam at the ±π meridian (top row) and disjoint polar embeddings (spheres, left panel). (b) Our method wraps seamlessly around the horizon and converges smoothly at the poles.

Spherical RoPE Positional Encoding

A valid ERP panorama must satisfy two topological constraints that standard linear RoPE fundamentally violates:

C1. Horizontal periodicity — the left and right boundaries are the same meridian and must be continuous.
C2. Polar convergence — all columns collapse to a single point at each pole.

Instead of one uniform fix, we partition the width-axis RoPE channels by their harmonic alignment with the image width and treat each band according to its role:

Low-frequency channels act as a global compass. We re-parameterize them as Cartesian coordinates on the unit sphere so that, as longitude wraps, the embedding traces a closed loop (satisfying C1), and at the poles all columns converge to a single point (satisfying C2).
High-frequency channels govern local texture. We keep them linear but snap each to the nearest integer harmonic of the width, enforcing exact 2π periodicity and eliminating seams without disturbing the pretrained local prior.

Semantic Distortion CFG Guidance

Pre-trained diffusion models already exhibit weak ERP behavior (polar stretching, horizon curvature) when prompted for 360° scenes. To amplify that latent prior and complement the hard geometry from Spherical RoPE, we extend classifier-free guidance to a three-way formulation:

ε̂ = ε uncond + w sem \cdot (ε cond - ε uncond) + γ \cdot (ε geo - ε cond)

The geometric term uses an anchored prompt — the user prompt concatenated with a geometric ERP description — so the difference ε_geo−ε_cond isolates pure projection geometry, orthogonal to semantic content. The scales w_sem and γ can be tuned independently; setting γ = 0 cleanly recovers standard CFG.

360° Video Comparison

Interactive panoramic video. Drag to look around while the video plays.

* Prompts adapted per method to match expected prompting style. See Supplementary.

Ours (LTX 2.3) 🔊 Audio

No audio

SphereDiff

BibTeX

@article{SpheRoPE2026,
  title={SpheRoPE: Zero-Shot Optimization-Free 360° Panorama Generation with Spherical RoPE},
  author={Or Hirschorn and Aaron Olender and Eli Alshan and Ianir Ideses and Lior Fritz and Sagie Benaim},
  year={TBD},
  journal={TBD},
  eprint={TBD},
  archivePrefix={arXiv},
  primaryClass={TBD},
  url={TBD}
}

SpheRoPE:Zero-Shot Optimization-Free 360° Panorama Generation with Spherical RoPE