In order to "uderstand" MNIST dataset (i.e. correctly predict test set) it turns out that it is enough to focus on the small amount of values, or more precisely on those that are very common. Let us explain it in more details. There are 6000 samples in the train set, consisting of 784=28*28 pixels for each image. If we fix one of the 784 pixels, we can get a typical histogram of its values 0-255 (cut by 1000 on the y axis for presentation purposes):
Here one can definitely see that som values are "discrete" (i.e. spikes) and do not "blend in" with the immediate neighbors (i.e. are not close to them in values). To be more precise, if we decide that to create two discrete classes low for values zero, and high for values [251-255] we will get the following samples:
here
black are unassigned values
grey are low class values, and
white are high class values.
It is visually obvious that the numbers can be well understood from those "modified" pictures and there is no need to store all the information with values at the range [0-255].
In fact, it is enough just to use 3 values: 1 for zeros, 2 for 251-255 and zero for the rest.
The same exercise was applied to other common (and simple) datasets: fashionMNIST, medMNIST (s). The results are taken from here.