T5 Encoder is ~5B parameters so back of the envelope would be ~10GB of VRAM (it'...

storystarling · 2026-01-23T07:48:47 1769154527

The 5B text encoder feels disproportionate for a 2B video model. If the text portion is dominating your VRAM usage it really hurts the inference economics.

Have you tried quantizing the T5? In my experience you can usually run these encoders in 8-bit or even 4-bit with negligible quality loss. Dropping that memory footprint would make this much more viable for consumer hardware.

schopra909 · 2026-01-23T14:49:31 1769179771

That all being said, you can just delete the T5 from memory after encoding the text so save on memory.

The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.

A 720p 5 second video is roughly 100K tokens of context

schopra909 · 2026-01-23T14:47:08 1769179628

Great idea! We haven’t tried it but def interested to see if that works as well.

When we started down this path, T5 was the standard (back in 2024).

Likely won’t be the text encoder for subsequent models, given its size (per your point) and age