Introduction
Laymen explanation
Technical explanation
There is a close connection between machine learning and compression. A system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution). An optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history).
Reference
https://en.wikipedia.org/wiki/Data_compression#Machine_learning