DynamoDB

Basics

NoSQL, fast & predicable performance, seamless scale, encryption at rest for sensitive information, on-demand backup, automatic deletion of expired items to save cost (TTL), automatic spread out for HA and durability, GlobalTable to sync across regions

Core Components

Table - schemeless, has collection of

Items - mandatory primary key (see below) & optional (0 or more) secondary indexes for query, has

Attributes - scalar, or nested (up to 32 levels deep)

PK: must be unique, two options, partition key only | partition key + sort key; partition key (hash attribute) determine where to store; partition + sort key (range attribute) then items with same partition key put together but must have different sort key and are sorted that way

Secondary indexes - one table can have 0 or more, two kinds: global or local; global: can have entirely different partition+sort key than defined in table, while local has same partition key & different sort key; up to (5 global + 5 local) secondary indexes per table; with secondary index it's like created a view of the table to query from different angle, and can specify what attributes from the base table are copied (projected) to the index; no index no alternative query pattern; affect writing performance (index need to be updated) and also take space; index is not automatically used (like SQL), only Query/Scan uses the index; eventual consistency;

DB Stream to capture data modification events in tables - in near real time, ordered, every event a record, lifetime 24 hours, serve an event source for lambda, and for replication, materialized views, data anaylsis, etc.

Limits: see developer guide

API: control plane - table levels, create/describe/list/update/delete tables; data plane - CRUD; streams - list, desc, get iterator, get records

Data Types

When creating table / secondary index, primary key (partition key, sort key) must provide type - string, number, binary; otherwise don't need to provide any type.

Scalar types:

number - up to 38 digits precision, range.. very large, transmit as string, may use for epoch timestamp
string - UTF8, not empty, <400k, if as simple pk < 2048 bytes, as sort key <1024 bytes, may use for date time; if sort by UTF-8 bytes
binary - <400k, transmit in base64 string
boolean - true / false
null

Document types (can nest)

list - [], like JSON array, no limit, no type,
map - {}, like JSON object, no limit, no type

Set types - must of same type, must not be empty

string set
number set
binary set

Consistency

Write - if HTTP 200 received, data is written and durable.

Read - eventual consistent read (consistent usually within 1 s after write across all storage); strongly consistent read - guaranteed most recent data, may not available in case of network failure

Capacity & Pricing

Pricing: storage (25G free) + reading capacity unit (RCU) + writing capacity unit (WCU)

Scaling: Auto scaling: max, min, percentage target, OR provisioned throughput

Partition & Data distribution

Partition - entirely by AWS, never manage partition. Partition key determines partition - best practice make partition key evenly divide activity

Basic Operations

Idempotent - if an operation is idempotent you can do it many times and the effect would be the same; when result is uncertain (network error) idempotent operation is ok to retry but otherwise needs certain check to retry;

Access - via HTTPS web service, stateless, each request with signature, authorize by IAM; NOTE: In December 2017, AWS began the process of migrating all DynamoDB endpoints to use secure certificates issued by Amazon Trust Services (ATS).

Access : authentication - root user, IAM user, IAM role (for Federated user, AWS Service user, applications on EC2); can use conditions to achieve item level / attribute level control; so worth further study here, can allow client direct interact with Dynamo DB without intermediate lambda (example)

CreateTable - define name, key schema, attribute definition (type for key attributes), Throughput Settings, with a JSON

DescribeTable - call with table name

PutItem (replace if exists) / BatchWriteItem (up to 25)

with table name and item (native support of JSON), if return 200 it's done (no commit); may request ReturnValue ALL_OLD

GetItem / BatchGetItem (up to 100)

Most efficient way to retrieve item, must provide pk, by default eventual consistency read, can provide ConsistRead parameter to request strong consist read; can use Projection Expression to return a subset of attributes; Batch get/write can reduce network round trips; they are wrapper of individual requests;

Query

(on table with composite PK, i.e. partition key + sort key, provide exact partition key & optional comparison condition on sort key)

Get all items with a particular partition key, can limit by sort key, also quick; no SQL JOIN available, one table only; must provide partition key; use parameter binding (:par1) ; use KeyConditionExpression(see below, very limited) to supply partition key with equation condition and optional sort key with comparison condition; optional FilterExpression for non-key condition; when to secondary index, can request strong consistency but on global index only eventual consistency supported; result set always sorted by sort key; Limit - maximum number of returned results, applied before filtering, so may return fewer than actually able to; Pagination - return result <= 1M, otherwise result with LastEvaluatedKey indicating remaining of results, should do another query with exactly same conditions & ExclusiveStartKey to continual retrieval of results, --page-size can limit number of items in a page (not in low level API), may return empty set with LastEvaluatedKey if all items read are filtered out; SDK may provide advanced pagination through API abstraction; Also return ScannedCount (items match key condition before filtering) & Count (returned items match both key condition & filter condition), for this page only; there's no SQL style count(*) available; eventual consistency by default, may request strong consistency

Key ConditionExpression:

must perform equality test on partition key partitionKeyName = :partitionKeyVal
optionally ONE & ONE ONLY comparison on sort key:
- =, <, <=, >, >=, a BETWEEN b AND c
- begins_with(a, substr)
- AND for combining partition key condition & ONE sort key condition, ONLY. (there is NO or...)

Can use ExclusiveStartKey & ScanIndexForward to control scan starting point & direction for pagination.

Filter Expression:

None-key attributes only
Same syntax as Condition Expressions, see below

Scan

Get all items in a table, resource hungry for large tables, use either sparsely, with small table, or just have to. FilterExpression to filter (if table is large still charged for scan even few results returned), ProjectionExpression to limit returned attributes; when to secondary index can request strong consistency but on global index only eventual consistency supported; further see guide;

Secondary index

UpdateTable, then GlobalSecondaryIndexUpdates / Create...; see guide; provide index name, key schema, (attribute definition: provide to the table if used in index as key), projection, Throughput Settings, quite like creating a table and in fact also like; not ready until Backfilling attribute (in DescribeTable) turned false;

UpdateItem

(actually "upsert", update or insert, no batch operation) - provide table name, keys, UpdateExpression (like SQL set with parameter placeholder) to update subset of attributes, optional ConditionExpression to update on condition (strong consistency, may use as optimistic lock, conditional update is idempotent if on the to-be-updated attribute, consume write capacity only, ConditionalCheckFailedException if condition does not met); ExpressionAttributes (bind parameter to values); ReturnValues specify what to return from the operation ALL_OLD entire item before update ALL_NEW entire item after update UPDATED_OLD UPDATED_NEW only return old/new attributes having been updated;

Atomic counters

Number attribute increased/decreased, unconditionally, during UpdateItem, similar to sequence in SQL; if retry failed operation would risk updating twice, ok for relaxed usage such as visitor count but not for financial transactions; (use conditioned update in such circumstances)

DeleteItem

(BatchWriteItem can batch delete) - provide table name and pk, optional ConditionExpression delete in condition, may ReturnValyes ALLL_OLD

Understanding Condition Expression

ConditionExpression:

Used in conditional write operations
(ref: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.OperatorsAndFunctions.html)
- () and AND/OR/NOT
- able to limit sort key in a range
- able to limit in a set of values
- able to test if attribute exists/not exists/type
- able to test sort key begins_with
- able to test node contains path
- able to test size

condition-expression ::=

      operand comparator operand

    | operand BETWEEN operand AND operand

    | operand IN ( operand (',' operand (, ...) ))

    | function

    | condition AND condition

    | condition OR condition

    | NOT condition

    | ( condition )

comparator ::=

    | <>

| <

    | <=

| >

    | >=

function ::=

    attribute_exists (path)

    | attribute_not_exists (path)

    | attribute_type (path, type)

    | begins_with (path, substr)

    | contains (path, operand)

    | size (path)

TTL

Auto-delete expired items. Enable on table. Per item basis.

Setup & Getting Started with DynamoDB

Local - as a .jar, also with Maven, or with Toolkit for Eclipse; now comes with a free and very helpful web-based user interface known as the DynamoDB JavaScript Shell

AWS - see guide

Access - use console, CLI, API

Getting started - see examples with various languages; on a first impression Javascript has much natural integration;

Programming SDK & Interfaces

SDK provides Low-level Interface, Document Interface & High-level interface, depends on the language

SDK do (1) format request (2) sign request (3) send request / receive response (4) extract result (5) basic retry logic in error

Low-level interface: available in every language, method resemble DynamoDB operation; construct request, send, etc.; use Data Type Descriptor to specify data type;

Document interfaces: perform data plane operations; data type implied (no Data Type Descriptor); convert JSON <-> native DynamoDB data types; available in Java, Javascript & Node.js ... but not all languages; provides a document wrapper;

Object-persistence Interface: do not perform operation on DB but work on objects that represent items in DB; available in Java & .NET, in Java use annotation similar to JPA (@DynamoDBTable), DynamoDBMapper as wrapper of low level client;

Low-level API: the on-wire HTTPS protocol

Error handling: returns http code (400), exception name (ResourceNotFoundException) & error message; SDK cares of propagation of error in language so programmer can try/catch; some errors are ok to retry (server error, 5xx) while others not (4xx); SDK do their own retries; REQUEST ID in response to quote if need support; retry can be configured with ClientConfiguration (java); Batch operation - wrap around individual operations, some could fail while others success, return individual requests that fail; most likely failure - throttling, can retry but strongly recommended "exponentially back off strategy", be nice to server :)

Durability, Backup & Recover

Durability by default: no SLA, no official figure, but looks like quite durable (spreading out & duplication), there's guessing that it's less than S3, backup is suggested; quote "Amazon talks about customers backing up DynamoDB to S3 using MapReduce. They also say that some customers back up DynamoDB using Redshift, which has DynamoDB compatibility built in", non-AWS backup also recommended;

On-demand backup: one click (or API call) backup & restore, no impact on performance & availability, complete in seconds regardless of size, can make unlimited numbers of backups, all copies retained until explicit delete; charge on storage space usage;

Restore - takes time according to size, to a new destination table

Continuous backup & point in time recovery - one click enable, can restore up to 35 days in any second, no performance penalty, but restore can take hours to complete; price based on size;

Other Options

Global table - multi-region, multi-master database

Encryption at rest - encrypt data at rest with AWS Key Management Service (KMS), enabled on table level, DynamoDB must be able to access the key to read from table;

In-memory acceleration with DAX - see guide

Monitoring

Tools:

CloudWatch alarms - on metrics
CloudWatch logs
CloudWatch events
CloudTrail log

Integration with other AWS Service

VPC Endpoints for DynamoDB - you can launch Amazon EC2 instances into a virtual private cloud, which is logically isolated from other networks—including the public Internet. With an Amazon VPC, you have control over its IP address range, subnets, routing tables, network gateways, and security settings.

With Cognito - use IAM role to generate temporary credentials for authenticated / unauthenticated users

With Redshift - copy data from DynamoDB to Redshift for SQL based analysis

With Apache Hive (data warehouse) on EMR - read and write data in DynamoDB tables, enable query live data in HiveQL (SQL like), copy data from DynamoDB to S3 and vice-versa, copy to Hadoop Distributed File System (HDFS), and vice-versa, perform join operations

Amazon DynamoDB integration with Amazon EMR and Redshift

(quote from https://cloudacademy.com/blog/amazon-dynamodb-ten-things/ ) In a typical scenario, Elastic MapReduce (EMR) performs its complex analysis on datasets stored on DynamoDB. Users will often also use AWS Redshift for data warehousing, where BI tasks are carried out on data loaded from DynamoDB tables to Redshift.

Google Sites

Report abuse