How to evaluate Machine Learning models
Evaluating Machine Learning models is the last stage before deploying a model to production. We evaluate Machine Learning models to confirm that they are performing as expected and that they are good enough for the task they were created for. The evaluation stage is performed after model training is finished. Different techniques are used depending on the type of problem and type of algorithm. Most evaluation techniques rely on comparing the training data with test data that was split from the original training data. This only works if both the training data as a whole and the test data are representative of the real world data.
This Study Guide covers subdomain 3.5 Evaluate machine learning models of the AWS exam. More information on the syllabus can be found in: AWS Machine Learning exam syllabus
Scroll to the bottom of the page for questions and answers test app.
Why evaluate models
Once a model has been trained you have to evaluate it to check if it will make good predictions on new unseen data. Evaluation is performed using data that was split from the initial training data and was not used in training.
Performance metrics depend on:
- Business need
- The problem the model is trying to solve, such as:
- Classification problems
- Regression problems
- Clustering problems
Video: AWS re:Invent 2020: Detect machine learning (ML) model drift in production
A 29.50 minute video by Sireesha Muppala from AWS
Test data
A portion of the training data is set aside for evaluation. This is data that the model has not been trained or validated on.
- Validation data is for training
- Test data is for evaluation
Shuffling data
Shuffling or randomizing data can prevent underfitting and overfitting.
When splitting the test and validation datasets from the training data we assume that the data in each dataset is representative of the real world data. This means that the algorithm expects to see the same distribution of features in the test datasets as the training dataset. To make sure this happens the data must be shuffled before splitting. This will avoid any visible or invisible bias created by the order of the data. Visible bias is caused by features you know are being optimized to produce predictions. Invisible bias is the relationship between these features and other features. So the difference may not be revealed by visualizing the data. Shuffling can remove this bias. Shuffling can be performed before splitting the data and before each testing epoch.
- Underfitting and Overfitting in Machine Learning
- Overfitting and Underfitting With Machine Learning Algorithms
- Overfit and underfit
Bias and variance
- Medium: The Bias-Variance Tradeoff
- Bias: Assumptions made by the model to make predictions easier.
- Variance: How prediction can change with different training data.
Bias can be evaluated during post training bias analysis. Bias can be introduced via:
- Training data
- The algorithm
Bias can be regarded as an inaccuracy or under performance of the model. However if the business purpose of the model is to decide which members of the public can buy your goods and services then it becomes a question of fairness. In some countries this can have legal ramifications.
The prediction error in supervised Machine Learning comprises:
- Bias error
- Variance error
- Irreducible error
Bias errors are caused by the algorithms innate simplifying assumptions. These are used to make learning from the training data easier. Some algorithms are more prone to bias than others, for example:
- Low bias: k-Nearest Neighbor
- High bias: Linear Learner
Variance error is how a prediction will change depending on the training data used.
- Low variance: Changing training data does not change predictions that much.
- High variance: Changing training data can produce significant different predictions.
Examples of variance are:
- Low variance: Linear Learner
- High variance: K-Nearest Neighbor
An ideal algorithm will have low bias and low variance. However there is often a tradeoff between them as this table shows.
Bias | Variance | |
Linear ML algorithms | high | low |
Non-linear algorithms | low | high |
Bias and variance are linked. Increasing bias will decrease variance. Increasing variance will decrease bias.
- AWS docs: Detect Posttraining Data and Model Bias – Amazon SageMaker
- AWS: Detect Pretraining Data Bias – Amazon SageMaker
- Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning
How to evaluate models
The metrics used to evaluate a model depend on the algorithm used and the type of Machine Learning problem.
- AWS docs: Monitor and Analyze Training Jobs Using Metrics – Amazon SageMaker
- Google docs: Evaluating models
- Various ways to evaluate a machine learning models performance
- How to Evaluate the Performance of Your Machine Learning Model
Video: Machine Learning Model Evaluation Metrics
A 34.02 minute by Maria Khalusova (JetBrains) via Anaconda.com. The recording level is quite low, so you will have to increase the volume. This video covers evaluation of Classification and Regression problems.
Classification problems
Confusion matrix
The Confusion Matrix identifies when classes get confused with each other. Rows are the actual values and columns are the values predicted by the model.
Class 1 | Class 2 | |
Class 1 | True Positives | False Positives |
Class 2 | False Negatives | True Negatives |
This subject is also covered in Data visualization for Machine Learning
Video: Machine Learning Fundamentals: The Confusion Matrix
A 7.12 minute by Josh Starmer of Statquest.
Accuracy
Accuracy is a comparison of correct predictions to the total predictions expressed as a percentage. This can be a poor indicator of Model performance. For example, if you are trying to detect fraud which only happens once in 1000 transactions. Simply saying there is no fraud would be 999 / 1000 = 0.999 or 99.9% accurate.
Precision
Precision is a comparison of true positives out of all positive predictions as a percentage. It shows how often the model is right when it predicts it is right. So if a model’s precision is high then when it predicts an outcome it will almost always be right. If, however, many of a models positive predictions turn out to be incorrect (false positives) the model has low precision for example 50%. So a model with 50% precision will produce predictions that are wrong half of the time.
True Positives / ( True Positives + False Positives)
Recall
Recall is also known as Sensitivity or true positive rate. A false negative is actually a positive incorrectly predicted by the model. So Recall tells us how many positive predictions did it miss.
True Positives / (True Positives + False Negatives)
F1 score
F1 score is a combination of Precision and Recall, with scores from zero to one. A high score means that a model is good at predicting positives and it’s positive predictions are not often reported as negatives. So a score of one indicates perfect precision and recall. The Macro F1 score is an average of many scores for multiple test runs.
F1 = 2 x Precision x Recall / Precision + Recall
Error metrics
Machine Learning: Testing and Error Metrics
This is a 44.42 minute video by Luis Serrano. Contents: Training, Testing – Evaluation Metrics: Accuracy, Precision, Recall, F1 Score – Types of Errors: Overfitting and Underfitting – Cross Validation and K-fold Cross Validation – Model Evaluation Graphs – Grid Search
Area Under the Curve (AUC)
Area under the curve is a statistical measurement of the area under a graphical line, or curve. The PR-AUC uses the Precision and Recall to plot a graph. The higher the value the better the model.
Receiver Operator Curve (ROC)
This is another type of Area Under the Curve metric. The Receiver Operator Curve is a graph of the True Positive Rate (TPR) and the False Positive Rate.
Regression problems
Root Mean squared error
This is comparison between values predicted by a model and actual predicted values. The lower the value the better the fit of the Models predictions.
R squared
R squared describes how well a regression model fits real world data. The values ranges from zero to 1 with a lower value showing higher model quality.
Neural networks
Deep Learning models are stochastic and use randomness to prepare data for each training epoch via shuffling. So each time a model is trained the skill result may change. One way to make the randomness repeatable across multiple training runs of the model is to fix the seed of the random number. This will allow a training epoch to be repeatable even though it uses randomness. However a more robust approach is to repeat the training of a model and comparison with test data multiple times to produce an average or grand mean of the model’s skill. Statistical methods can then be employed to compare the predictions with the test data. For example:
Standard error = standard deviation / square (count(scores))
One method to generate a population of outcomes is to use K-Fold Cross Validation. With this method the data is split into k-1 folds and evaluated on the reserved, or held out fold. The multiple results can be averaged or analyzed using other statistical methods to determine the performance of the model.
Summary
In this Study Guide we have discussed why we evaluate Machine Learning models and the techniques for evaluation. A model is evaluated to confirm that it is making good predictions. We aim to minimize both bias and variance. The evaluation techniques depend on the type of problem the model was trained to solve and the type of algorithm. They are a comparison of the model predictions with the true outcomes in the test data.
Credits
- Microscope Science in HD on Unsplash
Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.
Whizlabs AWS Certified Machine Learning Specialty
Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs
Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.
Practice test content
- Free Practice test – 15 questions
- Practice test 1 – 65 questions
- Practice test 2 – 65 questions
- Practice test 3 – 65 questions
Questions and answers
Whizlab’s AWS Certified Machine Learning Specialty course
- In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
- Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
- Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.

Course content
The course has 3 resources which can be purchased seperately, or together:
- 9 Practice tests with 271 questions
- Video course with 65 videos
- 9 hands on labs