Why Serialization? Two factors
1. Less memory utilization in the serialized format of data in memory or at disk and
2. Faster serialization / de-serialization
See benchmark analysis of the jvm based serelizers https://github.com/eishay/jvm-serializers/wiki.
Kryo Serializer is one of top performers in the benchmark analysis.
Twitter Chill is a Kryo Serializer for Scala that Spark uses.
Impact of Serialization
- Saw as much as 4× space reduction and 10× time reduction with Kryo
- Simple way to test serialization cost in your program: profile it with jstack or hprof
In Spark 2.0, Dataset API (which is a convergence of api features of Dataframe and RDD) uses its own serializer based on Tungsten. Tungsten gives better compaction of data and runs faster than Kryo. RDD will continue to use Kryo for disk spills, network shuffle and serialized persistence.