S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation
Preprint
We propose an interleaved multi-resolution diffusion transformer architecture that departs from conventional hourglass designs. By distributing high- and low-resolution blocks in a mixed topology, S2DiT improves memory efficiency and temporal coherence for streaming video generation under constrained compute budgets.