Training Machine Learning models
Before a Machine Learning Model can be deployed to the production environment it has to be trained. Training Machine Learning Models allows the algorithm to learn from the training data how to make a generalized prediction. This is an iterative process where the training data is processed multiple times as the algorithm learns from previous iterations to improve its fit to the data.
This study guide is split into three sections:
In Model training concepts how the Machine Learning algorithms work is discussed. The algorithm searches for parameter values that minimise the difference between a calculated value and that provided by the training data. Each iteration through the training data brings it closer to the optimum value.
In the Model training data and techniques section, how data is managed to train the Machine Learning Model is discussed. The data is formatted for ease of processing and compressing to speed up the learning process and reduce the training time. The training data is split into data that is used to train the Model and data that is used to test and validate the model. Validation occurs during training and testing is performed once training is complete. In situations where there is not much training data a technique called k-fold cross-validation can be employed.
How training jobs are run and the infrastructure options are discussed in Running training jobs.
This study guide cover subdomain 3.3 Train machine learning models of the exam syllabus.
Scroll to the bottom of the page for questions and answers.
Model training concepts
During training the Machine Learning algorithm updates parameters, or weights, to make the predicted output as similar as possible to the training data which contains the true values. The algorithm processes the data either a fixed number of times or when the loss, also called error, is as small as possible. These iterations, in which the whole training dataset is processed, are also called epochs.
The algorithm learns by remembering the outcomes of the previous iterations and using that information to improve the prediction in the next iteration. To perform this learning the Machine Learning algorithm needs:
- A Loss Function (also called an Objective Function)
- An optimization technique
The loss function is a measure of error. Error is the difference between the output the Model infers and the true value. The loss function is used to update the model after each iteration. The loss function is also called the objective function. If the Models inference was perfect, the loss is zero otherwise it is a value. A model comprises a set of weights and biases trained to have low loss on average.
- Google: Training and Loss
Root Mean Square Error (RMSE)
Root Mean Square Error (RMSE) is a simple way to calculate the error, or loss. For each set of outputs the difference is calculated, squared and averaged across the dataset. This gives the standard deviation of the difference between the actual result and the one inferred by the algorithm. Minimize root mean squared error (RMSE) for regression problems.
Log likelihood loss
The log likelihood loss is also known as cross-entropy loss. This works with the logarithm of probabilities. This is often used with classification problems. Log likelihood loss “is a measure of the difference between two probability distributions for a given random variable or set of events” (Jason Brownlee in Harshith)
- Harshith: Log loss function math explained
- A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation
Minima are the point on a graph where error is at a minimum. Local minima are where the loss is at a minimum in a region. The global minima is the lowest the loss can be in the domain.
The goal of model optimization is to identify what values for parameters are needed to make loss as low as possible, this is called a minima. Optimization techniques are used by the Machine Learning algorithm to find the minima.
To optimize parameters an exhaustive search could be used training a model for each change in a variable. However this would result in a large number of training jobs, most of which showing no improvement. Whilst this method is simple it soon becomes too slow and expensive as variables increase.
Gradient descent optimization
Gradient Descent optimization is a shortcut to finding the minima, the values that produce the smallest loss. Gradient Descent works by moving the parameter values in steps towards the minima. The loss decreases as you approach the optimal values and therefore the minima. The size of the steps is also called the learning rate.
The size of the steps taken in gradient descent is also called the Learning Rate. If the Learning Rate is too small it can take many iterations to find the minima. If it is too big you could overshoot the minima and never find it. The Learning Rate is another hyperparameter to be optimised
Video: Gradient Descent, Step-by-Step
A 23.53 minutes video from Stat Quest.
The advantage of Gradient Descent is that it finds the minima quicker. However there are also disadvantages:
- All the data has to be processed before parameters can be updated.
- The whole dataset must fit into memory.
- Processing can get stuck at local minima.
Gradient descent variants
There are three common variants for Gradient Descent:
- Gradient descent
- Stochastic Gradient Descent (SGD)
- Mini batch Gradient Descent
|Gradient Descent||Stochastic Gradient Descent||Mini batch Gradient Descent|
|Parameters updated||Every epoch||Evey record||Every batch|
|Gradient steps||Smooth||Noisy, oscillating||Not so smooth, but less noisy and oscillating|
Stochastic Gradient Descent
Where as Gradient Descent updates parameters once per epoch, Stochastic Gradient Descent updates parameters after each record. This results in identifying the minima more quickly. However SGD may oscillate in different directions and can be regarded as noisy.
Mini Batch Gradient Descent is a compromise between the two other methods. Training data is split up into small batches with parameters being updated after each batch is processed. This finds the minima quicker than Gradient Descent, although it is slower than Stochastic Gradient Descent. Its journey towards the minima is less noisy, has less oscillation than SGD. The big advantage of Mini Batch Gradient Descent is that because the batch is small it is much easier to hold the whole dataset in memory.
Model training data and techniques
To train a model you need data. There are two common formats for training data CSV and RecordIO protobuf. Other less common formats are JSON and Libsum. CSV is used because it is simple and ubiquitous. RecordIO protobuf is used because it is a binary data format allowing data to be highly compressed to reduce storage and speed up data transfer. Using this format enables you to stream data into the algorithm using pipe mode directly from S3. This will lead to faster start times and throughput.
Video: Serialization formats: JSON and Protobuf
This is a 4.59 minute video.
- 0 – JSON
- 2.49 – Protobuf
- Apache mxnet: Efficient Data Loaders
- AWS docs: Common Data Formats for Training – Amazon SageMaker
Splitting data and cross validation
There are three datasets used for training and evaluation of models:
- Training data
- Validation data
- Test data
|Training||Used to train the model||Training|
|Validation||Evaluate the effectiveness of training, used in hyperparameter optimisation||Hyperparameter optimisation|
|Test||Evaluate the final model||Final Evaluation|
Before splitting data it is usually randomized so that each part of the split data is representative of the data as a whole. This prevents bias in the data that may be sequenced by time or season. The absence of bias can be confirmed by comparing the error rates of all the folds, they should all be the same.
Once training is complete you will want to test and validate your model to confirm it is performing well enough. To do this a portion of the data is reserved and not used in training the algorithm. So the performance of the machine learning model can be measured on the new data that the model has not seen. This method cannot be used with small and sparse data because the proportion of the data you are reserving may be essential to train the model. The training data and the test data have to provide a suitable representation of the problem which is representative of the data as a whole. A typical split will be to retain 10% of the data for testing. Common splitting strategies are:
- Train-Test Split for Evaluating Machine Learning Algorithms
- Google: Training and Test Sets: Splitting Data
- Google Video: Training and Test Sets
Overfitting and underfitting
In Machine Learning we are trying to create models that generalize. So when the model processes data it has not been trained on, it infers a generalized solution. Overfitting can prevent this. With overfitting the model has been trained on data to a point where it can no longer generalize. It produces the correct inference for the training data, but cannot cope with the data it has not seen before.
In an overfitted Model the model recognizes noise or unimportant features to make the inference. When challenged with real data it was not trained the absence of the features may prevent it from making a correct inference. It could also be that the model identifies data noise as being important and makes an inference when none should have been made.
Underfitting is where a Model cannot capture the underlying structure of the data. This leads to poor inferences because some of the features of the data are not recognized. Also, because some of the features of the data are not recognized by the Model as being important, unimportant features may be used to make the inference.
- Wikipedia: Overfitting
- Overfitting Definition
Video: But What Is Overfitting in Machine Learning?
This is a 3.27 minute video by Oscar Alsing.
Bias in Machine Learning can be introduced by test data not being fully representative of the real world. If a Model is trained on unrepresentative data it may have biased inferences. In many cases businesses want to know why an inference was made and this may also have regulatory issues if a selection of the public are unable to access a service. Training data that is incomplete, faulty or prejudicial can introduce bias. The data has to be checked to confirm it is representative.
- AWS (introduction only, AWS Clarify is too new to be in the exam): Amazon SageMaker Clarify Detects Bias and Increases the Transparency of Machine Learning Models
Testing and validation techniques
During model training it is important to test and validate the Model to ensure it is performing as expected. This process uses three types of datasets:
- Training dataset: data the model is trained on
- Validation dataset: used to check the model is on track
- Test dataset used for the final assessment of the Model
There is more than one way to test and validate a Model. The most simple testing and validation technique is called Simple Holdout Validation.
- AWS docs: Validate a Machine Learning Model – Amazon SageMaker
- What is the Difference Between Test and Validation Datasets?
Simple Hold Out Validation
In Simple Hold Out Validation part of the test data is set aside for testing at the end, once the Model is trained. The two advantages of this technique is that it is simple and quick to implement and very easy to guarantee that the Model has never seen the test data before. However there are two drawbacks. First you need enough test data to be able to split off part and still have enough to train on. Secondly both the training data and test data must be representative of real world data separately. If the data is split into training, validation and testing then the model is checked against the validation data during training so the Model can be tweaked and tested against the test data at the end of training. A common split is 80%, 10%, 10%.
- Holdout vs cross validation in Machine Learning
- Making Predictive Models Robust: Holdout vs Cross-Validation
When you don’t have much data you want to use as much as possible to train the model, however you still need to test the Model once it is trained. Cross validation is used to evaluate a Model with a small quantity of training data. In cross validation the Model is repeatedly trained with a portion of the data and tested with a different portion. This is called resampling. In k-fold cross-validation, the fold is an equal group of data and k is the number of groups the data is split up into. For each Model a different group of data is set aside for testing. Therefore all data is used for training, but not all at the same time. The average of the Model scores is taken. This method leads to a less biased, less optimistic Model inference.
K-fold cross validation
k is the number of groups the data is divided into. The value of k is chosen to avoid high variance or bias. The most common values for k are 5 and 10. For a value of k = 10 the Model is trained on 90% of the data and tested on 10%. k has to create groups that are representative of the data as a whole.
If a large value of k is used the model will be trained more times, with the bulk of the training data being used each time. The time to train will increase because more training cycles will be performed. The error variance will increase because the test dataset will be small. However bias will reduce. Conversely if a small value of k is chosen less training cycles will be performed saving time. Error variance will decrease but bias will increase.
Video: How to Detect & Prevent Machine Learning Overfitting
This is a 10.06 minute video by Oscar Alsing.
- 0 – Explaining Overfitting
- 4.06 – K-fold cross validation
- 5.17 – Ocams razors
- 6.06 – Overfitting
- 6.25 – Early stopping
- 7.06 – Regularization
- 7.52 – Drop out
- 8.33 – Ensemble learning
Leave one out cross validation
For leave one out cross validation the value of k is k=n where n is the number of test samples in the test dataset. This technique produces unbiased Models and is simple to use. However it is computationally expensive because of the large number of Models trained. It is a useful technique when the training dataset is very small or when the estimate of Model accuracy is important.
Stratified K-fold cross validation
Stratified k-fold cross validation is used for imbalance data, for example rainfall throughput for the year, some parts of the year regularly receive more rain. In this technique data is sorted so that an equal proportion of target features is in each fold. This ensures each fold is representative of the data as a whole.
Iterated k-fold cross validation
Iterated k-fold Cross Validation is also known as Repeated k-fold Cross Validation. Iterated k-fold Cross Validation is used when datasets produce noisy estimates of model performance. This means that each time a model is trained the different k-folds produce different performance scores. Repeating the training multiple times and averaging the output can produce better results. Common values for repeats are 3, 5 and 10. This is often used when training datasets are small and with algorithms that are not complex, such as Linear Learner. The data folds may also be shuffled each time before splitting.
Running training jobs
There are four infrastructure options for running training jobs in SageMaker:
- SageMaker built-in algorithms
- Custom script on supported framework
- BYO algorithm and framework
- AWS Marketplace algorithms and frameworks
Video: Train Your ML Models Accurately with Amazon SageMaker
A 18.29 minutes video by Emily Webber from AWS.
- 0 – Each model runs in a separate environment on it’s own dedicated EC2
- 2.25 – How does training work
- 4.13 – Splitting data for Machine Learning
- 5.45 – You can send data to SageMaker in three channels
- 6.54 – Confusion Matrix
- 7.57 – Recall
- 8.13 – Precision
- 8.47 – Hyperparameter tuning
- 9.39 – 1. Pick Hyperparameters
- 10.04 – 2. Pick objective metric
- 10.38 – 3. Pick objective metric
- 11.10 – AUC, Area Under the Curve
- 11.56 – Demo with XGBoost
- 17.12 – Pro Tips
- 18.29 – End
Training infrastructure options
SageMaker built-in algorithms
There are 17 SageMaker built-in algorithms to choose from. SageMaker provides everything you need, all you have to do is tune the hyperparameters. SageMaker algorithms are also available pre-packaged into their own docker images. For more information about built in algorithms see the links below:
- How to select a model for a given machine learning problem
- SageMaker image processing algorithms
- SageMaker text processing algorithms
- SageMaker supervised algorithms
- SageMaker unsupervised algorithms
Custom script on supported framework
SageMaker has Docker containers pre-loaded with deep learning frameworks such as Apache MXNet, TensorFlow and PyTorch. All you need to provide is your own algorithm in Scikit-learn or SparkML. So this method allows you to use your own scripts whilst running inside SageMaker environment leveraging its orchestration services.
BYO algorithm and framework
For the Bring Your Own algorithm and framework you take a SageMaker image and load up your own framework and algorithm. You can also bring your own container. With this option you are doing everything yourself and running the algorithm in the SageMaker environment. This gives you access to the orchestration and processing capabilities of SageMaker.
AWS Marketplace algorithms and frameworks
The AWS Marketplace hosts trained Models and untrained algorithms provided by third party software vendors. These are packaged up into Docker containers for easy deployment into the SageMaker environment. The Models are accessed through RESTful endpoints.
Parameters and Hyperparameters
Parameters can be updated and tuned during training by the Model. Hyperparameters are set before training and tuned by the user.
|When updated||During training||Before training|
|How updated||The ML algorithm using the loss function and optimization technique||Human users or automatic hyperparameter optimization|
|Example variables||Weights, bias||Learning rate|
These revision notes concern parameters only. Hyperparameter optimization is covered in Model Tuning.
How to create a training job in SageMaker
In SageMaker training is managed by a training job. A training job has this information:
- URL of the S3 bucket containing training data
- The SageMaker EC2 instances you selected to host the Docker containers with the Model
- URL of the S3 bucket for the output
- The ECR path to the training code
Running a training job using containers
SageMaker is based on Docker containers. However how involved with Docker development depends on how much you want to leverage SageMaker features.
|Docker engagement||Infrastructure options|
|Highest||Adapt an existing non-SageMaker container image|
|Higher||Extend a pre-built SageMaker image|
|Lower||Pre-built SageMaker container image|
|Lowest||Built-in SageMaker algorithm or framework|
To adapt an existing container image you are taking a Docker container that was developed independently of SageMaker and adapting it to work in the SageMaker environment. The choice of operating system, frameworks, tools and your algorithm will have been installed and tested. Now the SageMaker Training toolkit has to be enabled.
Extending a pre-built SageMaker Docker container image involves modifying a pre-built SageMaker algorithm already packaged in a Docker container. A framework and algorithm will already be present for you to work with.
Using prebuilt SageMaker docker container images. These are called Deep Learning Container Images. You can choose:
- SageMaker built-in images
- Frameworks eg. Apache MXNet, TensorFlow and PyTorch
- Libraries: Scikit-learn, SparkML
Using the SageMaker built-in algorithms or frameworks requires the least user interaction with Docker. The SageMaker consol, or SageMaker Studio spins up the container and infrastructure in the background without user direction or interaction.
- AWS docs: Using Docker containers with SageMaker – Amazon SageMaker
- Deep Learning containers: deep-learning-containers
Video: Machine Learning with Containers and Amazon SageMaker – AWS Online Tech Talks
A 49.45 minute video from AWS.
Bring your own containers
It is possible to build your own container and then enable it to operate in the SageMaker infrastructure. However, with SageMaker supporting so many Machine Learning frameworks it may be better to take a pre-built SageMaker Deep Learning container with the desired framework already set up. This is what you have to define:
- Define training / server entry points in the Dockerfile
- Set up the expected file structure that define the location of:
- Hyperparameter files
- Inputs / output channels
- Supporting channels
- The model has to be hosted with an HTTP server to frontend HTTP requests from SageMaker
- A Dockerfile to describe how the container is built
- Bring Your Own Container — Amazon SageMaker Examples 1.0.0 documentation
- Bring Your Own R Algorithm
- Building your own algorithm container — Amazon SageMaker Examples 1.0.0 documentation
Video: SageMaker – Bring your Own Container
A 15.16 minute video from AWS.
When to use bring your own containers
Bring your own containers can be used to do things that SageMaker does not provide:
- Access to the latest versions of frameworks not yet available in SageMaker
- Custom code or algorithms
- Access to datalakes via APIs
- Access to backend runtime non-Python kernals, for example R, Julia
P3 instances have large numbers of GPUs for fast processing of large workloads. The GPUs are supported by large bandwidth networking infrastructure for high throughput. The effect of this engineering is to reduce training times from days to minutes and can lead to cost savings.
Video: Introduction to Amazon EC2 P3 Instances
A 2.18 minute video from AWS.
Components of an ML training job for Deep Learning
These are the components of a Machine Learning training job for Deep Learning:
- Define an IAM Role
- SageMaker Notebook, this is a Jupyter
- Framework: eg. TensorFlow
- Interface: e.g. Keras. Keras is an open source library that provides a Python interface for artificial neural networks
- Keras optimizers:
- Adam – Adaptive moment saturation
- SGD – Stochastic gradient descent
- Neural Network parameters, the size and shape of the neural network
- Dataset locations in S3:
- training dataset
- validation dataset
- Evaluation dataset
- Output location for validation report
- Location to save the trained Model
Video: Machine Learning Models with TensorFlow Using Amazon SageMaker – AWS Online Tech Talks
A 40.15 minute video from AWS
This study guide has covered the training phase of Machine Learning. Data is formatted to help the algorithm to process it quickly. The data is split into training and testing portions so that a Model can be tested on data it has not been trained on. The internal processing of the algorithm was explained in Modeling Concepts. In the last section the infrastructure options for running the training job was described.
- Cycling photo Markus Spiske on Unsplash
- Skiing photo Ethan Walsweer on Unsplash
- Running Quino Al on Unsplash
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)
10 questions and answers
Whizlabs review – AWS Certified Machine Learning Specialty
Need more practice with the exams? Check out Whizlab’s free test with 15 questions. They also have three practice tests (65 questions each) and five section tests (10-15 questions each). Money off promo codes are below. For the AWS Certified Machine Learning Specialty Whizlabs provides a practice tests, a video course and hands-on labs. These…
Pluralsight review – AWS Certified Machine Learning Specialty
Contains affiliate links. If you go to Pluralsight’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…
Amazon Study Guide review – AWS Certified Machine Learning Specialty
This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…