SageMaker is a fully managed machine learning service to build, train, and deploy machine learning (ML) models quickly.
SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.
SageMaker is designed for high availability with no maintenance windows or scheduled downtimes.
SageMaker APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS region to provide fault tolerance in the event of a server failure or AZ outage.
SageMaker provides a full end-to-end workflow, but users can continue to use their existing tools with SageMaker.
SageMaker supports Jupyter notebooks.
SageMaker allows users to select the number and type of instance used for the hosted notebook, training & model hosting.
Generate example data
Involves exploring and preprocessing, or “wrangling,” example data before using it for model training.
To preprocess data, you typically do the following:
Fetch the data
Clean the data
Prepare or transform the data
Train a model
Model training includes both training and evaluating the model, as follows:
Training the model
Needs an algorithm, which depends on a number of factors.
Need compute resources for training.
Evaluating the model
determine whether the accuracy of the inferences is acceptable.
Training Data Format – File mode vs Pipe mode
Most Amazon SageMaker algorithms work best when using the optimized protobuf recordIO format for the training data.
Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it.
File mode loads all of the data from S3 to the training instance volumes.
In Pipe mode, the training job streams data directly from S3.
Streaming can provide faster start times for training jobs and better throughput.
With Pipe mode, reduce the size of the EBS volumes for the training instances is also reduced Pipe mode needs only enough disk space to store your final model artifacts.
File mode needs disk space to store both the final model artifacts and the full training dataset.
Build Model
SageMaker provides several built-in machine learning algorithms that can be used for a variety of problem types.
Write a custom training script in a machine learning framework that SageMaker supports, and use one of the pre-built framework containers to run it in SageMaker.
Bring your own algorithm or model to train or host in SageMaker.
SageMaker provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training and inference.
By using containers, machine learning algorithms can be trained and deploy models quickly and reliably at any scale.
Use an algorithm that you subscribe to from AWS Marketplace.
Deploy the model
Re-engineer a model before integrating it with application and deploy it.
supports both "hosting services" and "batch transform".
Hosting services
Provides an HTTPS endpoint where the machine learning model is available to provide inferences.
Supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
Supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload
Batch transform
To inferences on entire datasets, consider using batch transform as an alternative to hosting services.
SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.
SageMaker allows using encrypted S3 buckets for model artifacts and data, as well as pass a KMS key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume.
Requests to the SageMaker API and console are made over a secure (SSL) connection.
SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
SageMaker notebooks are collaborative notebooks that are built into SageMaker Studio that can be launched quickly.
Can be accessed without setting up compute instances and file storage.
Charged only for the resources consumed when notebooks is running.
Instance types can be easily switching if more or less computing power is needed, during the experimentation phase.
BlazingText algorithm
Provides highly optimized implementations of the Word2vec and text classification algorithms.
Word2vec algorithm
Useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
Maps words to high-quality distributed vectors, whose representation is called word embeddings
Word embeddings capture the semantic relationships between words.
Text classification
It is an important task for applications performing web searches, information retrieval, ranking, and document classification
Provides the Skip-gram and continuous bag-of-words (CBOW) training architectures.
DeepAR forecasting algorithm
It is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
Use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.
Factorization machine
It is a general-purpose supervised learning algorithm used for both classification and regression tasks.
Extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically.
Image classification algorithm
A supervised learning algorithm that supports multi-label classification.
Takes an image as input and outputs one or more labels
Uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
Recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.
IP Insights
Is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
Designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
K-means algorithm
It is an unsupervised learning algorithm for clustering
Attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
K-nearest neighbors (k-NN) algorithm
It is an index-based algorithm.
Uses a non-parametric method for classification or regression.
For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.
Latent Dirichlet Allocation (LDA) algorithm
It is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
Is used to discover a user-specified number of topics shared by documents within a text corpus.
Linear Learner
Are supervised learning algorithms used for solving either classification or regression problems
Neural Topic Model (NTM) Algorithm
It is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution.
Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.
Object2Vec algorithm
Is a general-purpose neural embedding algorithm that is highly customizable
Can learn low-dimensional dense embeddings of high-dimensional objects.
Object Detection algorithm
Detects and classifies objects in images using a single deep neural network.
It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
Principal Component Analysis
It is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.
Random Cut Forest (RCF)
It is an unsupervised algorithm for detecting anomalous data points within a data set.
Semantic segmentation algorithm
Provides a fine-grained, pixel-level approach to developing computer vision applications.
SageMaker Sequence to Sequence (seq2seq)
It is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
Key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)
XGBoost (eXtreme Gradient Boosting)
It is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models.
Helps speed up the throughput and decrease the latency of getting real-time inferences from the deep learning models deployed as SageMaker hosted models.
Adds inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
Provides automated data labeling using machine learning.
Helps building highly accurate training datasets for machine learning quickly.
Offers easy access to labelers through Amazon Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.
Allows using your own labelers or using vendors recommended by Amazon through AWS Marketplace.
Helps lower the labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.
Significantly reduces the time and effort required to create datasets for training to reduce costs.
Provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.
First selects a random sample of data and sends it to Amazon Mechanical Turk to be labeled.
Results are then used to train a labeling model that attempts to label a new sample of raw data automatically.
Labels are committed when the model can label the data with a confidence score that meets or exceeds a threshold you set.
For confidence score falling below the defined threshold, the data is sent to human labelers.
Some of the data labeled by humans is used to generate a new training dataset for the labeling model, and the model is automatically retrained to improve its accuracy.
Process repeats with each sample of raw data to be labeled.
Labeling model becomes more capable of automatically labeling raw data with each iteration, and less data is routed to humans.
SageMaker Automatic Model Training
Hyperparameters are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models.
Automatic model tuning is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
Best Practices for Hyperparameter tuning
Choosing the Number of Hyperparameters – limit the search to a smaller number as difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
Choosing Hyperparameter Ranges – DO NOT specify a very large range to cover every possible value for a hyperparameter. Range of values for hyperparameters that you choose to search can significantly affect the success of hyperparameter optimization.
Using Logarithmic Scales for Hyperparameters – log-scaled hyperparameter can be converted to improve hyperparameter optimization.
Choosing the Best Number of Concurrent Training Jobs – running one training job at a time achieves the best results with the least amount of compute time.
Running Training Jobs on Multiple Instances – Design distributed training jobs so that you get they report the objective metric that you
SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
Automatically optimizes models built with popular deep learning frameworks that can be used to deploy on multiple hardware platforms.
Optimized models run up to two times faster and consume less than a tenth of the resources of typical machine learning models.
Users pay for ML compute, storage and data processing resources their use for hosting the notebook, training the model, performing predictions & logging the outputs.