AWS Kinesis

Generals

Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.

Kinesis Data Streams

Amazon Kinesis Data Streams enable real-time processing of streaming data at massive scale
Kinesis Streams enables building of custom applications that process or analyze streaming data for specialized needs.
A Kinesis data stream is a set of shards.
Each shard has a sequence of data records.
Each data record has a sequence number assigned by Kinesis Data Streams.
Producers use the PUT call to store data in a stream. (Each record <= 1 mb)
Producer supplies a partition key, which is used to distribute the PUTs across shards.
You can have multiple Amazon Kinesis consumers operating on the stream at the same time—independently.
Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application.
With Amazon Kinesis, you can ingest real-time data such as application logs, website clickstreams, IoT telemetry data, and more into your databases, data lakes, and data warehouses, or build your own real-time applications using this data.
Amazon Kinesis enables you to process and analyze data as it arrives and respond in real-time instead of having to wait until all your data is collected before the processing can begin.
You can create data-processing applications, known as Amazon Kinesis Streams applications. A typical Amazon Kinesis Streams application reads data from an Amazon Kinesis stream as data records. These applications can use the Amazon Kinesis Client Library, and they can run on Amazon EC2 instances. The processed records can be sent to dashboards, used to generate alerts, dynamically change pricing and advertising strategies, or send data to a variety of other AWS services.
Kinesis Streams features
1. handles provisioning, deployment, ongoing-maintenance of hardware, software, or other services for the data streams.
2. Manages the infrastructure, storage, networking, and configuration needed to stream the data at the level of required data throughput.
3. Synchronously replicates data across three facilities in an AWS Region, providing high availability and data durability.
4. Data is available for 24 hours by default and up to 8760 hours (365 days) to be read, re-read, analyzed, or moved to long-term storage (for example Amazon S3 or Amazon Redshift)
Data such as clickstreams, application logs, social media etc can be added from multiple sources and within seconds is available for processing to the Kinesis Applications.
Kinesis provides ordering of records, as well as the ability to read and/or replay records in the same order to multiple applications.
Amazon Kinesis is designed to process streaming big data and the pricing model allows heavy PUTs rate.
Ordering is guaranteed on a shard level, but not across the all stream.
Multiple Kinesis Data Streams applications can consume data from a stream, so that multiple actions, like archiving and processing, can take place concurrently and independently
Kinesis Streams is useful for rapidly moving data off data producers and then continuously processing the data, be it to transform the data before emitting to a data store, run real-time metrics and analytics, or derive more complex data streams for further processing.
1. Accelerated log and data feed intake: Data producers can push data to Kinesis stream as soon as it is produced, preventing any data loss and making it available for processing within seconds.
2. Real-time metrics and reporting: Metrics can be extracted and used to generate reports from data in real-time.
3. Real-time data analytics: Run real-time streaming data analytics.
4. Complex stream processing: Create Directed Acyclic Graphs (DAGs) of Kinesis Applications and data streams, with Kinesis applications adding to another Amazon Kinesis stream for further processing, enabling successive stages of stream processing.
Kinesis limits
1. Stores records of a stream for up to 24 hours, by default, which can be extended to max 7 days.
2. Maximum size of a data blob (the data payload before Base64-encoding) within one record is 1 megabyte (MB).
3. Each shard can support up to 1000 PUT records per second.
S3 is a cost-effective way to store the data, but not designed to handle a stream of data in real-time.
Data Record
1. A record is the unit of data stored in an Amazon Kinesis data stream.
2. A record is composed of a sequence number, partition key, and data blob, which is an immutable sequence of bytes.
3. Maximum size of a data blob is 1 MB.
4. Partition key
  1. Partition key is used to segregate and route records to different shards of a stream.
  2. A partition key is specified by the data producer while adding data to an Amazon Kinesis stream.
5. Sequence number
  1. A sequence number is a unique identifier for each record.
  2. Kinesis assigns a Sequence number, when a data producer calls PutRecord or PutRecords operation to add data to a stream.
  3. Sequence numbers for the same partition key generally increase over time; the longer the time period between PutRecord or PutRecords requests, the larger the sequence numbers become.
Data Stream
1. Data stream represents a group of data records.
2. Data records in a data stream are distributed into shards.
Shard
1. Each shard provides has a sequence of data records.
2. Streams are made of shards and is the base throughput unit of an Kinesis stream, as pricing is per shard basis.
3. Each shard can support:
  1. up to five transactions per second for reads
  2. up to a maximum total data read rate of 2 MB per second
  3. up to 1,000 records per second for writes
  4. up to a maximum total data write rate of 1 MB per second (including partition keys)
4. Each shard provides a fixed unit of capacity. If the limits are exceeded, either by data throughput or the number of PUT records, the put data call will be rejected with a ProvisionedThroughputExceeded exception.
5. This can be handled by
  1. Implementing a retry on the data producer side, if this is due to a temporary rise of the stream’s input data rate
  2. Dynamically scaling the number of shared (resharding) to provide enough capacity for the put data calls to consistently succeed.
Retention Period
1. All data is stored for 24 hours, by default and can be increased to 168 hours (7 days) maximum.
Producers
1. A producer puts data records into Kinesis data streams.
2. To put data into the stream, the name of the stream, a partition key, and the data blob to be added to the stream should be specified.
3. Partition key is used to determine which shard in the stream the data record is added to.
Consumers
1. A consumer is an application built to read and process data records from Kinesis data streams.
2. The KCL requires an application name that is unique across your applications and across Amazon DynamoDB tables in the same Region. It uses the application name configuration value in the following ways:
3. • All workers associated with this application name are assumed to be working together on the same stream. These workers may be distributed on multiple instances. If you run an additional instance of the same application code, but with a different application name, the KCL treats the second instance as an entirely separate application that is also operating on the same stream.
4. • The KCL creates a DynamoDB table with the application name and uses the table to maintain state information (such as checkpoints and worker-shard mapping) for the application. Each application has its own DynamoDB table
Worker: An Amazon Kinesis Application can have multiple application instances and a worker is the processing unit that maps to each application instance.
Record Processor: It is the processing unit that processes data from a shard of an Amazon Kinesis data stream. One worker maps to one or more record processors. One record processor maps to one shard and processes records from that shard.
Kinesis Security

Supports Server-side encryption using Key Management Service (KMS) for encrypting the data at rest
Supports writing encrypted data to a data stream by encrypting and decrypting on the client side.
Supports interface VPC endpoint to keep traffic between VPC and Kinesis Data Streams from leaving the Amazon network. Interface VPC endpoints don’t require an IGW, NAT device, VPN connection, or Direct Connect.
Integrated with IAM to control access to Kinesis Data Streams resources.
Integrated with CloudTrail, which provides a record of actions taken by a user, role, or an AWS service in Kinesis Data Streams.

Kinesis Producer

Data to Kinesis Data Streams can be added via API/SDK (PutRecord and PutRecords) operations, Kinesis Producer Library (KPL), or Kinesis Agent.
API
- PutRecords & PutRecord operations are synchronous operation sends single/multiple records to the stream per HTTP request.
- Use PutRecords to achieve higher throughput per data producer
- Helps manage many aspects of Kinesis Data Streams (including creating streams, resharding, and putting and getting records)
Amazon Kinesis Agent
- is a pre-built Java application that offers an easy way to collect and send data to Amazon Kinesis stream.
- Can be installed on a Linux-based server environments such as web servers, log servers, and database servers.
- Can be configured to monitor certain files on the disk and then continuously send new data to the Amazon Kinesis stream.
Amazon Kinesis Producer Library (KPL)
- Is an easy to use and highly configurable library that helps putting data into an Amazon Kinesis stream.
- Provides a layer of abstraction specifically for ingesting data.
- Presents a simple, asynchronous, and reliable interface that helps achieve high producer throughput with minimal client resources.
- Batches messages, as it aggregates records to increase payload size and improve throughput.
- Collects records and uses PutRecords to write multiple records to multiple shards per request.
- Writes to one or more Kinesis data streams with an automatic and configurable retry mechanism.
- Integrates seamlessly with the Kinesis Client Library (KCL) to de-aggregate batched records on the consumer
- Submits CloudWatch metrics to provide visibility into performance
Third Party and Open source
- Log4j appender
- Apache Kafka
- Flume, fluentd, etc.

Kinesis Consumers

Kinesis Application is a data consumer that reads and processes data from an Kinesis Data Stream and can be build using either Amazon Kinesis API or Amazon Kinesis Client Library (KCL)
Shards in a stream provide 2 MB/sec of read throughput per shard, by default, which is shared by all the consumers reading from a given shard.
Amazon Kinesis Client Library (KCL)
- Is a pre-built library with multiple language support
- Delivers all records for a given partition key to same record processor
- Makes it easier to build multiple applications reading from the same stream for e.g. to perform counting, aggregation, and filtering.
- Handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance.
- Uses a unique DynamoDB table to keep track of the application’s state, so if Kinesis Data Streams application receives provisioned-throughput exceptions, increase the provisioned throughput for the DynamoDB table
Amazon Kinesis Connector Library
- Is a pre-built library that helps you easily integrate Amazon Kinesis Streams with other AWS services and third-party tools.
- Kinesis Client Library is required for Kinesis Connector Library.
- Can be replaced by Lambda.
Amazon Kinesis Storm Spout is a pre-built library that helps you easily integrate Amazon Kinesis Streams with Apache Storm.
AWS Lambda, Kinesis Data Firehose, Kinesis Data Analytics also act as consumers for Kinesis Data Streams.
Kinesis Enhanced fan-out

Provides logical 2 MB/sec throughput pipes between consumers and shards for Kinesis Data Streams consumers.
Allows customers to scale the number of consumers reading from a data stream in parallel, while maintaining high performance and without contending for read throughput with other consumers

Kinesis Data Firehose

Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data.
Kinesis Data Firehose is a fully managed service that automatically scales to match the throughput of the data and requires no ongoing administration or need to write applications or manage resources.
Data transfer solution for delivering real time streaming data to destinations such as S3, Redshift, Elasticsearch service, and Splunk.
Is NOT Real Time, but Near Real Time as it supports batching and buffers streaming data to a certain size (Buffer Size in MBs) or for a certain period of time (Buffer Interval in seconds) before delivering it to destinations.
Supports batching, compression, and encryption of the data before loading it, minimizing the amount of storage used at the destination and increasing security.
Supports data compression, minimizing the amount of storage used at the destination. It currently supports GZIP, ZIP, and SNAPPY compression formats. Only GZIP is supported if the data is further loaded to Redshift.
Supports data at rest encryption using KMS after the data is delivered to the S3 bucket.
Supports multiple producers as datasource, which include Kinesis data stream, Kinesis Agent, or the Kinesis Data Firehose API using the AWS SDK, CloudWatch Logs, CloudWatch Events, or AWS IoT.
Supports out of box data transformation as well as custom transformation using Lambda function to transform incoming source data and deliver the transformed data to destinations.
Supports source record backup with custom data transformation with Lambda, where Kinesis Data Firehose will deliver the un-transformed incoming data to a separate S3 bucket.
Uses at least once semantics for data delivery. In rare circumstances such as request timeout upon data delivery attempt, delivery retry by Firehose could introduce duplicates if the previous request eventually goes through.
Supports Interface VPC endpoint (AWS Private Link) to keep traffic between the VPC and Kinesis Data Firehose from leaving the Amazon network. Interface VPC endpoints don’t require an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection
Kinesis Data Firehose delivery stream
1. Underlying entity of Kinesis Data Firehose, where the data is sent.
Record
1. Data sent by data producer to a Kinesis Data Firehose delivery stream.
2. Maximum size of a record (before Base64-encoding) is 1024 KB.
Data producer
1. Producers send records to Kinesis Data Firehose delivery streams.
Buffer size and buffer interval
1. Kinesis Data Firehose buffers incoming streaming data to a certain size or for a certain time period before delivering it to destinations.
2. Buffer size and buffer interval can be configured while creating the delivery stream.
3. Buffer size is in MBs and ranges from 1MB to 128MB for S3 destination and 1MB to 100MB for Elasticsearch Service destination.
4. Buffer interval is in seconds and ranges from 60 secs to 900 secs.
5. Firehose raises buffer size dynamically to catch up and make sure that all data is delivered to the destination, if data delivery to destination is falling behind data writing to delivery stream
6. Buffer size is applied before compression
Destination
1. A destination is the data store where the data will be delivered.
2. Supports S3, Redshift, Elasticsearch, and Splunk as destinations.

Kinesis Video Streams

Amazon Kinesis Video Streams is a fully managed AWS service that you can use to stream live video from devices to the AWS Cloud, or build applications for real-time video processing or batch-oriented video analytics.
Kinesis Video Streams isn't just storage for video data. You can use it to watch your video streams in real time as they are received in the cloud.
You can either monitor your live streams in the AWS Management Console, or develop your own monitoring application that uses the Kinesis Video Streams API library to display live video.
You can use Kinesis Video Streams to capture massive amounts of live video data from millions of sources, including smartphones, security cameras, webcams, cameras embedded in cars, drones, and other sources.
You can also send non-video time-serialized data such as audio data, thermal imagery, depth data, RADAR data, and more.
As live video streams from these sources into a Kinesis video stream, you can build applications that can access the data, frame-by-frame, in real time for low-latency processing.
Kinesis Video Streams is source-agnostic; you can stream video from a computer's webcam using the GStreamer library, or from a camera on your network using RTSP.
You can also configure your Kinesis video stream to durably store media data for the specified retention period.
Kinesis Video Streams automatically stores this data and encrypts it at rest.
Additionally, Kinesis Video Streams time-indexes stored data based on both the producer time stamps and ingestion time stamps.
You can build applications that periodically batch-process the video data, or you can create applications that require ad hoc access to historical data for different use cases.
Your custom applications, real-time or batch-oriented, can run on Amazon EC2 instances.
These applications might process data using open source deep-learning algorithms, or use third-party applications that integrate with Kinesis Video Streams.
Benefits of using Kinesis Video Streams include the following:

Connect and stream from millions of devices
Durably store, encrypt, and index data
Focus on managing applications instead of infrastructure
Stream data more securely
Pay as you go

Live Streaming (HLS) should be used to playback the Kinesis Video Streams.
The client can use HLS for live playback. Use GetHLSStreamingSessionURL API to retrieve the HLS streaming session URL, then provide the URL to the video player.

Kinesis Analytics

Kinesis Analytics now gives you the option to preprocess your data with AWS Lambda.
This gives you a great deal of flexibility in defining what data gets analyzed by your Kinesis Analytics application. You can also define how that data is structured before it is queried by your SQL.
Preprocess data before starting your analysis.

Kinesis Video Streams vs AWS Elemental MediaLive

AWS Elemental MediaLive is a broadcast-grade live video encoding service. It lets you create high-quality video streams for delivery to broadcast televisions and internet-connected multiscreen devices, like connected TVs, tablets, smart phones, and set-top boxes. The service functions independently or as part of AWS Media Services.
Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for real-time and batch-driven machine learning (ML), video playback, analytics, and other processing. It enables customers to build machine-vision based applications that power smart homes, smart cities, industrial automation, security monitoring, and more.

Kinesis Data Streams vs Kinesis Data Firehose

Kinesis data streams – Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs. However, requires manual scaling and provisioning. Data typically is made available in a stream for 24 hours, but for an additional cost, users can gain data availability for up to seven days.
Kinesis Data Firehose – Firehose handles loading data streams directly into AWS products for processing. Scaling is handled automatically, up to gigabytes per second, and allows for batching, encrypting, and compressing. Firehose also allows for streaming to S3, Elasticsearch Service, or Redshift, where data can be copied for processing through additional services.

Kinesis vs S3

Videos

Kinesis Stream Example

Page updated

Report abuse