phtograph of a lab technician using a microscope to symbolize evaluate Machine Learning models

How to evaluate Machine Learning models

Evaluating Machine Learning models is the last stage before deploying a model to production. We evaluate Machine Learning models to confirm that they are performing as expected and that they are good enough for the task they were created for. The evaluation stage is performed after model training is finished. Different techniques are used depending on the type of problem and type of algorithm. Most evaluation techniques rely on comparing the training data with test data that was split from the original training data. This only works if both the training data as a whole and the test data are representative of the real world data. 

This Study Guide covers subdomain 3.5 Evaluate machine learning models of the AWS exam. More information on the syllabus can be found in: AWS Machine Learning exam syllabus

Scroll to the bottom of the page for questions and answers test app.

Why evaluate models

Once a model has been trained you have to evaluate it to check if it will make good predictions on new unseen data. Evaluation is performed using data that was split from the initial training data and was not used in training.

Performance metrics depend on:

  1. Business need
  2. The problem the model is trying to solve, such as:
    1. Classification problems
    2. Regression problems
    3. Clustering problems

Video: AWS re:Invent 2020: Detect machine learning (ML) model drift in production

A 29.50 minute video by Sireesha Muppala from AWS

Test data

A portion of the training data is set aside for evaluation. This is data that the model has not been trained or validated on. 

  • Validation data is for training
  • Test data is for evaluation

Shuffling data

Shuffling or randomizing data can prevent underfitting and overfitting.

When splitting the test and validation datasets from the training data we assume that the data in each dataset is representative of the real world data. This means that the algorithm expects to see the same distribution of features in the test datasets as the training dataset. To make sure this happens the data must be shuffled before splitting. This will avoid any visible or invisible bias created by the order of the data. Visible bias is caused by features you know are being optimized to produce predictions. Invisible bias is the relationship between these features and other features. So the difference may not be revealed by visualizing the data. Shuffling can remove this bias. Shuffling can be performed before splitting the data and before each testing epoch.

Bias and variance

  • Bias: Assumptions made by the model to make predictions easier.
  • Variance: How prediction can change with different training data.

Bias can be evaluated during post training bias analysis. Bias can be introduced via:

  • Training data
  • The algorithm

Bias can be regarded as an inaccuracy or under performance of the model. However if the business purpose of the model is to decide which members of the public can buy your goods and services then it becomes a question of fairness. In some countries this can have legal ramifications.

The prediction error in supervised Machine Learning comprises:

  1. Bias error
  2. Variance error
  3. Irreducible error

Bias errors are caused by the algorithms innate simplifying assumptions. These are used to make learning from the training data easier. Some algorithms are more prone to bias than others, for example:

  • Low bias: k-Nearest Neighbor
  • High bias: Linear Learner

Variance error is how a prediction will change depending on the training data used.

  • Low variance: Changing training data does not change predictions that much.
  • High variance: Changing training data can produce significant different predictions.

Examples of variance are:

  • Low variance: Linear Learner
  • High variance: K-Nearest Neighbor

An ideal algorithm will have low bias and low variance. However there is often a tradeoff between them as this table shows.

Linear ML algorithmshighlow
Non-linear algorithmslowhigh

Bias and variance are linked. Increasing bias will decrease variance. Increasing variance will decrease bias.

How to evaluate models

The metrics used to evaluate a model depend on the algorithm used and the type of Machine Learning problem.

Video: Machine Learning Model Evaluation Metrics

A 34.02 minute by Maria Khalusova  (JetBrains) via The recording level is quite low, so you will have to increase the volume. This video covers evaluation of Classification and Regression problems.

Classification problems

Confusion matrix

The Confusion Matrix identifies when classes get confused with each other. Rows are the actual values and columns are the values predicted by the model.

Class 1Class 2
Class 1True PositivesFalse Positives
Class 2False NegativesTrue Negatives

This subject is also covered in Data visualization for Machine Learning

Video: Machine Learning Fundamentals: The Confusion Matrix

A 7.12 minute by Josh Starmer of Statquest.


Accuracy is a comparison of correct predictions to the total predictions expressed as a percentage. This can be a poor indicator of Model performance. For example, if you are trying to detect fraud which only happens once in 1000 transactions. Simply saying there is no fraud would be 999 / 1000 = 0.999 or 99.9% accurate.


Precision is a comparison of true positives out of all positive predictions as a percentage. It shows how often the model is right when it predicts it is right. So if a model’s precision is high then when it predicts an outcome it will almost always be right. If, however, many of a models positive predictions turn out to be incorrect (false positives) the model has low precision for example 50%. So a model with 50% precision will produce predictions that are wrong half of the time.

True Positives / ( True Positives + False Positives)


Recall is also known as Sensitivity or true positive rate. A false negative is actually a positive incorrectly predicted by the model. So Recall tells us how many positive predictions did it miss.

True Positives / (True Positives + False Negatives)

F1 score

F1 score is a combination of Precision and Recall, with scores from zero to one. A high score means that a model is good at predicting positives and it’s positive predictions are not often reported as negatives. So a score of one indicates perfect precision and recall. The Macro F1 score is an average of many scores for multiple test runs.

F1 = 2 x Precision x Recall / Precision + Recall

Error metrics

Machine Learning: Testing and Error Metrics

This is a 44.42 minute video by Luis Serrano. Contents: Training, Testing – Evaluation Metrics: Accuracy, Precision, Recall, F1 Score – Types of Errors: Overfitting and Underfitting – Cross Validation and K-fold Cross Validation – Model Evaluation Graphs – Grid Search

Area Under the Curve (AUC)

Area under the curve is a statistical measurement of the area under a graphical line, or curve. The PR-AUC uses the Precision and Recall to plot a graph. The higher the value the better the model.

Receiver Operator Curve (ROC)

This is another type of Area Under the Curve metric. The Receiver Operator Curve is a graph of the True Positive Rate (TPR) and the False Positive Rate.

Regression problems

Root Mean squared error

This is comparison between values predicted by a model and actual predicted values. The lower the value the better the fit of the Models predictions.

R squared

R squared describes how well a regression model fits real world data. The values ranges from zero to 1 with a lower value showing higher model quality.

Neural networks

Deep Learning models are stochastic and use randomness to prepare data for each training epoch via shuffling. So each time a model is trained the skill result may change. One way to make the randomness repeatable across multiple training runs of the model is to fix the seed of the random number. This will allow a training epoch to be repeatable even though it uses randomness. However a more robust approach is to repeat the training of a model and  comparison with test data multiple times to produce an average or grand mean of the model’s skill. Statistical methods can then be employed to compare the predictions with the test data. For example:

Standard error = standard deviation / square (count(scores))

One method to generate a population of outcomes is to use K-Fold Cross Validation. With this method the data is split into k-1 folds and evaluated on the reserved, or held out fold. The multiple results can be averaged or analyzed using other statistical methods to determine the performance of the model.


In this Study Guide we have discussed why we evaluate Machine Learning models and the techniques for evaluation. A model is evaluated to confirm that it is making good predictions. We aim to minimize both bias and variance. The evaluation techniques depend on the type of problem the model was trained to solve and the type of algorithm. They are a comparison of the model predictions with the true outcomes in the test data.


Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.

Whizlabs AWS Certified Machine Learning Specialty

Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs

Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.

Practice test content

  • Free Practice test – 15 questions
  • Practice test 1 – 65 questions
  • Practice test 2 – 65 questions
  • Practice test 3 – 65 questions
Whizlabs AWS certified machine learning course with a robot hand

Section test content

  • Core ML Concepts – 10 questions
  • Data Engineering – 11 questions
  • Exploratory Data Analysis – 13 questions
  • Modeling – 15 questions
  • Machine Learning Implementation and Operations – 12 questions

Questions and answers

Created on By Michael Stainsbury

3.5 How to evaluate Machine Learning models

Five questions from a test bank of 10 questions about subdomain 3.5 Evaluate machine learning models of the Modeling knowledge domain.

Test data used for Evaluation.

1 / 5

What is test data used for?

2 / 5

<–?–> tells us how many positive predictions the model missed.

3 / 5

<–?–> is calculated as a comparison of true positives out of all positive predictions as a percentage.

4 / 5

What do the metrics used to evaluate a model depend on?

5 / 5

<–?–> is a portion of the training data is set aside for evaluation. This is data that the model has not been trained or validated on.

2 words left

Your score is

The average score is 57%


Whizlab’s AWS Certified Machine Learning Specialty course

  • In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
  • Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
  • Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.
Whizlabs AWS certified machine learning course with a robot hand

Course content

The course has 3 resources which can be purchased seperately, or together:

  • 9 Practice tests with 271 questions
  • Video course with 65 videos
  • 9 hands on labs

Similar Posts