The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
AFAIK this is actually a separate mechanism, which is part of the visual cortex and not the retina. Essentially recognizing even a single object requires the complete attention of pretty much your entire brain in the moment of recognition.
What I am referring to is a much more basic form of shape recognition that goes on at the level of the neural networks in the retina.
(Source: "The mind is flat" by Nick Chater)