A photograph of cyclists racing to symbolize model training

Training Machine Learning models

Before a Machine Learning Model can be deployed to the production environment it has to be trained. Training Machine Learning Models allows the algorithm to learn from the training data how to make a generalized prediction. This is an iterative process where the training data is processed multiple times as the algorithm learns from previous iterations to improve its fit to the data.

This study guide is split into three sections:

  1. Model training concepts
  2. Model training data and techniques
  3. Running training jobs

In Model training concepts how the Machine Learning algorithms work is discussed. The algorithm searches for parameter values that minimise the difference between a calculated value and that provided by the training data. Each iteration through the training data brings it closer to the optimum value.

In the Model training data and techniques section, how data is managed to train the Machine Learning Model is discussed. The data is formatted for ease of processing and compressing to speed up the learning process and reduce the training time. The training data is split into data that is used to train the Model and data that is used to test and validate the model. Validation occurs during training and testing is performed once training is complete. In situations where there is not much training data a technique called k-fold cross-validation can be employed.

How training jobs are run and the infrastructure options are discussed in Running training jobs

This study guide cover subdomain 3.3 Train machine learning models of the exam syllabus.

Questions

To confirm your understanding scroll to the bottom of the page for questions and answers.

Model training concepts

During training the Machine Learning algorithm updates parameters, or weights, to make the predicted output as similar as possible to the training data which contains the true values. The algorithm processes the data either a fixed number of times or when the loss, also called error, is as small as possible. These iterations, in which the whole training dataset is processed, are also called epochs.

The algorithm learns by remembering the outcomes of the previous iterations and using that information to improve the prediction in the next iteration. To perform this learning the Machine Learning algorithm needs:

  • A Loss Function (also called an Objective Function)
  • An optimization technique

Loss Function

The loss function is a measure of error. Error is the difference between the output the Model infers and the true value. The loss function is used to update the model after each iteration. The loss function is also called the objective function. If the Models inference was perfect, the loss is zero otherwise it is a value. A model comprises a set of weights and biases trained to have low loss on average.

Root Mean Square Error (RMSE)

Root Mean Square Error (RMSE) is a simple way to calculate the error, or loss. For each set of outputs the difference is calculated, squared and averaged across the dataset. This gives the standard deviation of the difference between the actual result and the one inferred by the algorithm.

Log likelihood loss

The log likelihood loss is also known as cross-entropy loss. This works with the logarithm of probabilities. This is often used with classification problems. Log likelihood loss “is a measure of the difference between two probability distributions for a given random variable or set of events” (Jason Brownlee in Harshith)

Minima

Minima are the point on a graph where error is at a minimum. Local minima are where the loss is at a minimum in a region. The global minima is the lowest the loss can be in the domain.

Optimization techniques

The goal of model optimization is to identify what values for parameters are needed to make loss as low as possible, this is called a minima. Optimization techniques are used by the Machine Learning algorithm to find the minima.

Exhaustive search

To optimize parameters an exhaustive search could be used training a model for each change in a variable. However this would result in a large number of training jobs, most of which showing no improvement. Whilst this method is simple it soon becomes too slow and expensive as variables increase. 

Gradient descent optimization

Skier skiing down hill to symbolize gradient descent

Gradient Descent optimization is a shortcut to finding the minima, the values that produce the smallest loss. Gradient Descent works by moving the parameter values in steps towards the minima. The loss decreases as you approach the optimal values and therefore the minima. The size of the steps is also called the learning rate.

Learning Rate

The size of the steps taken in gradient descent is also called the Learning Rate. If the Learning Rate is too small it can take many iterations to find the minima. If it is too big you could overshoot the minima and never find it. The Learning Rate is another hyperparameter to be optimised

Video: Gradient Descent, Step-by-Step

A 23.53 minutes video from Stat Quest.

Gradient descent

The advantage of Gradient Descent is that it finds the minima quicker. However there are also disadvantages:

  1. All the data has to be processed before parameters can be updated.
  2. The whole dataset must fit into memory.
  3. Processing can get stuck at local minima.

Gradient descent variants

There are three common variants for Gradient Descent:

  1. Gradient descent
  2. Stochastic Gradient Descent (SGD)
  3. Mini batch Gradient Descent
Gradient DescentStochastic Gradient DescentMini batch Gradient Descent
Parameters updatedEvery epochEvey recordEvery batch
SpeedSlowestFastSlower
Gradient stepsSmoothNoisy, oscillatingNot so smooth, but less noisy and oscillating
Comparison table of Gradient Descent variants

Stochastic Gradient Descent

Where as Gradient Descent updates parameters once per epoch, Stochastic Gradient Descent updates parameters after each record. This results in identifying the minima more quickly. However SGD may oscillate in different directions and can be regarded as noisy.

Mini-batch

Mini Batch Gradient Descent is a compromise between the two other methods. Training data is split up into small batches with parameters being updated after each batch is processed. This finds the minima quicker than Gradient Descent, although it is slower than Stochastic Gradient Descent. Its journey towards the minima is less noisy, has less oscillation than SGD. The big advantage of Mini Batch Gradient Descent is that because the batch is small it is much easier to hold the whole dataset in memory.

Model training data and techniques

Formatting data

To train a model you need data. There are two common formats for training data CSV and RecordIO protobuf. Other less common formats are JSON and Libsum. CSV is used because it is simple and ubiquitous. RecordIO protobuf is used because it is a binary data format allowing data to be highly compressed to reduce storage and speed up data transfer. Using this format enables you to stream data into the algorithm using pipe mode directly from S3. This will lead to faster start times and throughput.

Video: Serialization formats: JSON and Protobuf

This is a 4.59 minute video.

  • 0 – JSON
  • 2.49 – Protobuf

Splitting data and cross validation

There are three datasets used for training and evaluation of models:

  1. Training data
  2. Validation data
  3. Test data
Dataset typeDescriptionPhase
TrainingUsed to train the modelTraining
ValidationEvaluate the effectiveness of training, used in hyperparameter optimisationHyperparameter optimisation
TestEvaluate the final modelFinal Evaluation
Comparison of the three datasets used to train a Model

Randomized data

Before splitting data it is usually randomized so that each part of the split data is representative of the data as a whole. This prevents bias in the data that may be sequenced by time or season. The absence of bias can be confirmed by comparing the error rates of all the folds, they should all be the same.

Splitting

Once training is complete you will want to test and validate your model to confirm it is performing well enough. To do this a portion of the data is reserved and not used in training the algorithm. So the performance of the machine learning model can be measured on the new data that the model has not seen. This method cannot be used with small and sparse data because the proportion of the data you are reserving may be essential to train the model. The training data and the test data have to provide a suitable representation of the problem which is representative of the data as a whole. A typical split will be to retain 10% of the data for testing. Common splitting strategies are:

Training dataset8070
Testing dataset1015
Validation dataset1015
Total100100
Common splitting strategies

Overfitting and underfitting

In Machine Learning we are trying to create models that generalize. So when the model processes data it has not been trained on, it infers a generalized solution. Overfitting can prevent this. With overfitting the model has been trained on data to a point where it can no longer generalize. It produces the correct inference for the training data, but cannot cope with the data it has not seen before. 

In an overfitted Model the model recognizes noise or unimportant features to make the inference. When challenged with real data it was not trained the absence of the features may prevent it from making a correct inference. It could also be that the model identifies data noise as being important and makes an inference when none should have been made.

Underfitting is where a Model cannot capture the underlying structure of the data. This leads to poor inferences because some of the features of the data are not recognized. Also, because some of the features of the data are not recognized by the Model as being important, unimportant features may be used to make the inference.

Video: But What Is Overfitting in Machine Learning?

This is a 3.27 minute video by Oscar Alsing.

Bias

Bias in Machine Learning can be introduced by test data not being fully representative of the real world. If a Model is trained on unrepresentative data it may have biased inferences. In many cases businesses want to know why an inference was made and this may also have regulatory issues if a selection of the public are unable to access a service. Training data that is incomplete, faulty or prejudicial can introduce bias. The data has to be checked to confirm it is representative.

Testing and validation techniques

During model training it is important to test and validate the Model to ensure it is performing as expected. This process uses three types of datasets:

  1. Training dataset: data the model is trained on
  2. Validation dataset: used to check the model is on track
  3. Test dataset used for the final assessment of the Model

There is more than one way to test and validate a Model. The most simple testing and validation technique is called Simple Holdout Validation.

Simple Hold Out Validation

In Simple Hold Out Validation part of the test data is set aside for testing at the end, once the Model is trained. The two advantages of this technique is that it is simple and quick to implement and very easy to guarantee that the Model has never seen the test data before. However there are two drawbacks. First you need enough test data to be able to split off part and still have enough to train on. Secondly both the training data and test data must be representative of real world data separately. If the data is split into training, validation and testing then the model is checked against the validation data during training so the Model can be  tweaked and tested against the test data at the end of training. A common split is 80%, 10%, 10%.

Cross validation

When you don’t have much data you want to use as much as possible to train the model, however you still need to test the Model once it is trained. Cross validation is used to evaluate a Model with a small quantity of training data. In cross validation the Model is repeatedly trained with a portion of the data and tested with a different portion. This is called resampling. In k-fold cross-validation, the fold is an equal group of data and k is the number of groups the data is split up into. For each Model a different group of data is set aside for testing. Therefore all data is used for training, but not all at the same time. The average of the Model scores is taken. This method leads to a less biased, less optimistic Model inference.

K-fold cross validation

k is the number of groups the data is divided into. The value of k is chosen to avoid high variance or bias. The most common values for k are 5 and 10. For a value of k = 10 the Model is trained on 90% of the data and tested on 10%. k has to create groups that are representative of the data as a whole.

k size

If a large value of k is used the model will be trained more times, with the bulk of the training data being used each time. The time to train will increase because more training cycles will be performed. The error variance will increase because the test dataset will be small. However bias will reduce. Conversely if a small value of k is chosen less training cycles will be performed saving time. Error variance  will decrease but bias will increase.

Video: How to Detect & Prevent Machine Learning Overfitting

This is a 10.06 minute video by Oscar Alsing.

  • 0 – Explaining Overfitting
  • 4.06 – K-fold cross validation
  • 5.17 – Ocams razors
  • 6.06 – Overfitting
  • 6.25 – Early stopping
  • 7.06 – Regularization
  • 7.52 – Drop out
  • 8.33 – Ensemble learning

Leave one out cross validation

For leave one out cross validation the value of k is k=n where n is the number of test samples in the test dataset. This technique produces unbiased Models and is simple to use. However it is computationally expensive because of the large number of Models trained. It is a useful technique when the training dataset is very small or when the estimate of Model accuracy is important.

Stratified K-fold cross validation

Stratified k-fold cross validation is used for imbalance data, for example rainfall throughput for the year, some parts of the year regularly receive more rain. In this technique data is sorted so that an equal proportion of target features is in each fold. This ensures each fold is representative of the data as a whole.

Iterated k-fold cross validation

Iterated k-fold Cross Validation is also known as Repeated k-fold Cross Validation. Iterated k-fold Cross Validation is used when datasets produce noisy estimates of model performance. This means that each time a model is trained the different k-folds produce different performance scores. Repeating the training multiple times and averaging the output can produce better results. Common values for repeats are 3, 5 and 10. This is often used when training datasets are small and with algorithms that are not complex, such as Linear Learner. The data folds may also be shuffled each time before splitting.

Running training jobs

A photograph of people running a race to symbolize running training jobs

There are four infrastructure options for running training jobs in SageMaker:

  1. SageMaker built-in algorithms
  2. Custom script on supported framework
  3. BYO algorithm and framework
  4. AWS Marketplace algorithms and frameworks
Video: Train Your ML Models Accurately with Amazon SageMaker

A 18.29 minutes video by Emily Webber from AWS.

  • 0 – Each model runs in a separate environment on it’s own dedicated EC2
  • 2.25 – How does training work
  • 4.13 – Splitting data for Machine Learning
  • 5.45 – You can send data to SageMaker in three channels
  • 6.54 – Confusion Matrix
  • 7.57 – Recall
  • 8.13 – Precision
  • 8.47 – Hyperparameter tuning
  • 9.39 – 1. Pick Hyperparameters
  • 10.04 – 2. Pick objective metric
  • 10.38 – 3. Pick objective metric
  • 11.10 – AUC, Area Under the Curve
  • 11.56 – Demo with XGBoost
  • 17.12 – Pro Tips
  • 18.29 – End

Training infrastructure options

SageMaker built-in algorithms

There are 17 SageMaker built-in algorithms to choose from. SageMaker provides everything you need, all you have to do is tune the hyperparameters. SageMaker algorithms are also available pre-packaged into their own docker images. For more information about built in algorithms see the links below:

Custom script on supported framework

SageMaker has Docker containers pre-loaded with deep learning frameworks such as Apache MXNet, TensorFlow and PyTorch. All you need to provide is your own algorithm in Scikit-learn or SparkML. So this method allows you to use your own scripts whilst running inside SageMaker environment leveraging its orchestration services.

BYO algorithm and framework

For the Bring Your Own algorithm and framework you take a SageMaker image and load up your own framework and algorithm. You can also bring your own container. With this option you are doing everything yourself and running the algorithm in the SageMaker environment. This gives you access to the orchestration and processing capabilities of SageMaker.

AWS Marketplace algorithms and frameworks

The AWS Marketplace hosts trained Models and untrained algorithms provided by third party software vendors. These are packaged up into Docker containers for easy deployment into the SageMaker environment. The Models are accessed through RESTful endpoints.

Parameters and Hyperparameters

Parameters can be updated and tuned during training by the Model. Hyperparameters are set before training and tuned by the user.

ParametersHyperparameters
When updatedDuring trainingBefore training
How updatedThe ML algorithm using the loss function and optimization techniqueHuman users or automatic hyperparameter optimization
Example variablesWeights, biasLearning rate
Comparison of Parameters and Hyperparameters

These revision notes concern parameters only. Hyperparameter optimization is covered in Model Tuning.

How to create a training job in SageMaker

In SageMaker training is managed by a training job. A training job has this information:

  1. URL of the S3 bucket containing training data
  2. The SageMaker EC2 instances you selected to host the Docker containers with the Model
  3. URL of the S3 bucket for the output
  4. The ECR path to the training code

Running a training job using containers

SageMaker is based on Docker containers. However how involved with Docker development depends on how much you want to leverage SageMaker features.

Docker engagementInfrastructure options
HighestAdapt an existing non-SageMaker container image
HigherExtend a pre-built SageMaker image
LowerPre-built SageMaker container image
LowestBuilt-in SageMaker algorithm or framework

To adapt an existing container image you are taking a Docker container that was developed independently of SageMaker and adapting it to work in the SageMaker environment. The choice of operating system, frameworks, tools and your algorithm will have been installed and tested. Now the SageMaker Training toolkit has to be enabled.

Extending a pre-built SageMaker Docker container image involves modifying a pre-built SageMaker algorithm already packaged in a Docker container. A framework and algorithm will already be present for you to work with.

Using prebuilt SageMaker docker container images. These are called Deep Learning Container Images. You can choose:

  • SageMaker built-in images
  • Frameworks eg. Apache MXNet, TensorFlow and PyTorch
  • Libraries: Scikit-learn, SparkML

Using the SageMaker built-in algorithms or frameworks requires the least user interaction with Docker. The SageMaker consol, or SageMaker Studio spins up the container and infrastructure in the background without user direction or interaction.

Video: Machine Learning with Containers and Amazon SageMaker – AWS Online Tech Talks

A 49.45 minute video from AWS.

Bring your own containers

It is possible to build your own container and then enable it to operate in the SageMaker infrastructure. However, with SageMaker supporting so many Machine Learning frameworks it may be better to take a pre-built SageMaker Deep Learning container with the desired framework already set up. This is what you have to define:

  1. Define training / server entry points in the Dockerfile
  2. Set up the expected file structure that define the location of:
    1. Hyperparameter files
    2. Inputs / output channels
    3. Supporting channels
  3. The model has to be hosted with an HTTP server to frontend HTTP requests from SageMaker
  4. A Dockerfile to describe how the container is built
Video: SageMaker – Bring your Own Container

A 15.16 minute video from AWS.

When to use bring your own containers

Bring your own containers can be used to do things that SageMaker does not provide:

  • Access to the latest versions of frameworks not yet available in SageMaker
  • Custom code or algorithms
  • Access to datalakes via APIs
  • Access to backend runtime non-Python kernals, for example R, Julia

P3 instances

P3 instances have large numbers of GPUs for fast processing of large workloads. The GPUs are supported by large bandwidth networking infrastructure for high throughput. The effect of this engineering is to reduce training times from days to minutes and can lead to cost savings.

Video: Introduction to Amazon EC2 P3 Instances

A 2.18 minute video from AWS.

Components of an ML training job for Deep Learning

These are the components of a Machine Learning training job for Deep Learning:

  1. Define an IAM Role
  2. SageMaker Notebook, this is a Jupyter
  3. Framework: eg. TensorFlow
  4. Interface: e.g. Keras. Keras is an open source library that provides a Python interface for artificial neural networks
  5. Keras optimizers:
    1. Adam – Adaptive moment saturation
    2. SGD – Stochastic gradient descent
  6. Neural Network parameters, the size and shape of the neural network
  7. Hyperparameters
  8. Dataset locations in S3:
    1. training dataset
    2. validation dataset
    3. Evaluation dataset
  9. Output location for validation report
  10. Location to save the trained Model
Video: Machine Learning Models with TensorFlow Using Amazon SageMaker – AWS Online Tech Talks

A 40.15 minute video from AWS

Summary

This study guide has covered the training phase of Machine Learning. Data is formatted to help the algorithm to process it quickly. The data is split into training and testing portions so that a Model can be tested on data it has not been trained on. The internal processing of the algorithm was explained in Modeling Concepts. In the last section the infrastructure options for running the training job was described.

Credits


AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


10 questions and answers

8
Created on By Michael Stainsbury

3.3 Training Machine Learning models (Silver)

10 quiz type questions that cover subdomain 3.3 Train machine learning models of the Modeling knowledge domain.

1 / 10

Why is data split and retained for testing and validation?

2 / 10

What is the point on a graph where error is at its lowest level called?

3 / 10

What are the infrastructure options for running training jobs in SageMaker?

4 / 10

What are the common model optimization techniques?

What are the advantages of using recordIO protobuf?

  1. It can be compressed. This reduces storage and speeds up data transfer.
  2. It enables you to stream data into the algorithm using pipe mode directly from S3.

5 / 10

What training data format can be compressed and streamed into the algorithm using pipe mode directly from S3.

6 / 10

In Model training what is another name for parameters, the internal values being used to process the data?

7 / 10

What are the deployment methods for SageMaker built in algorithms?

Sagemaker managed algorithms.
Docker images with Sagemaker algorithms pre-loaded.

8 / 10

The types of datasets used to test and train Models are:

  1. Training dataset
  2. <–?–> dataset
  3. Test dataset
1 words left

9 / 10

10 / 10

The goal of model optimization is to identify what values for parameters are needed to make <–?–> as low as possible. This is called a minima.

1 words left

Your score is

The average score is 76%

0%


Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *