A picture may be (literally) worth a thousand words for AI

Stop Reading, Start Seeing: this may solve a long-standing problm in AI

The great capabilities of LLMs come with enormous energy and hardware costs, which are partly due to one inherent weakness: when processessing text, the computational cost and memory requirements grow quadratically in text length. This quadratic scaling of costs is very limiting especially when LLMs are employed for long, complex and potentially multi-step tasks.

To tackle this issue, researchers at DeepSeek-AI have proposed a radical new approach: instead of forcing an AI to "read" every single word in a long document, what if we could turn the entire text into a picture and have the AI "see" it instead?

The core idea behind "contexts optical compression" is simple yet powerful. The DeepSeek-OCR model takes long pages of text, renders them as a single image, and then uses a highly efficient vision encoder to represent that image with a small number of "vision tokens." A language model can then decode these vision tokens to reconstruct the original tex.

This compression is extremely powerful. At a compression ratio of nearly 10-to-1 (for example, representing 600-700 text tokens with just 64 vision tokens), the model can reconstruct the original text with 96.5% precision. Higher compression rates (e.g. 20-to-1) lead to lower accuracy (e.g. 60%).

One of the biggest technical hurdles for vision models is processing high-resolution images. Doing so consumes massive amounts of computer memory. But worry not: DeepSeek-AI thought about this, too: they developed a novel architecture called the "DeepEncoder."

The DeepEncoder cleverly combines two different components. The first part, using window attention, focuses on local details within the image, much like reading words in a small area. The second part, which uses global attention, is designed to see the "big picture" and understand the overall context. The true innovation, however, lies in what connects them: a 16x token compressor. This module sits between the two components and dramatically reduces the amount of data that the most memory-intensive part has to handle. This smart design allows the model to process high-resolution documents with incredible detail without overloading its memory.

Another profound implication discussed in the paper is its potential to create a more natural and efficient memory system for AI. The researchers propose that optical compression could be used to simulate a form of memory decay, much like how human memory works.

The proposed mechanism is elegant. Recent conversations or newly added information could be stored as high-res images, making them crystal clear and easily accessible. As that information gets older, the images could be progressively downsized or blurred. In other words, the older the memory, the lower the resolution, similarly to what happens to our older memories. In this way, contexts optical compression method enables a form of memory decay that mirrors biological forgetting curves, where recent information maintains high fidelity while distant memories naturally fade through increased compression ratios.

Original article: https://arxiv.org/abs/2510.18234

Page updated

Google Sites

Report abuse