The unit of data within Kafka is called a message. It can be XML, JSON, String or anything. Many Kafka developers favor the use of Apache Avro, which is a serialization framework originally developed for Hadoop. Kafka does not care and store everything.
A batch is a collection of messages, all of which are being produced to the same topic
Topic a common name used to store and publish a particular stream of data. Basically, topics in Kafka are similar to tables in the database, but not containing all constraints.
For example, consider we have topic with name “activity-log” which has 3 partitions with names:
activity-log-0
activity-log-1
activity-log-2
When a source system send messages to activity-log topic, these messages (1-n) can be stored in either of the partition based on load and various other factors.
A single Kafka server is called a broker. The broker receives messages from producer clients, assigns and maintain their offsets, and stores the messages in storage system.
High Throughput Support for millions of messages with modest hardware
Scalability Highly scalable distributed systems with no downtime
Replication Messages are replicated across the cluster to provide support for multiple subscribers and balances the consumers in case of failures
Durability Provides support for persistence of message to disk
Stream Processing Used with real-time streaming applications like Apache Spark & Storm
Data Loss Kafka with proper configurations can ensure zero data loss