Two gloved hands holding a antibacterial hand sanitizer gel dispenser symbolising data cleansing

Data cleansing and preparation for modeling

Understanding data, cleansing data and dataset generation are important first steps in exploratory data analysis. Every other phase in the Machine Learning process relies on the data being cleaned and prepared. This Study Guide starts with statistical techniques used to help understand the data. Once data is understood it has to be cleaned up so that Machine Learning algorithms can operate effectively. Sometimes the quantity of data for training is insufficient or the production data needs to be enhanced. To do this services and techniques for dataset generation and data augmentation are used.

Sanitize and prepare data for modeling is sub-domain 2.1 of the Exploratory Data Analysis knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus


To confirm your understanding scroll to the bottom of the page for the test app

Understand your data

Before you can sanitize and prepare your data you must first understand your data. The data can be described using Descriptive Statistics. These statistics are solely concerned with describing the data we have. This type of statistics does not assume the data comes from a larger pool or population of data.

The statistics fall into three groups:

  • Overall statistics
  • Multivariate statistics
  • Attribute statistics

Overall statistics

Overall statistics give you an idea of the size what you are dealing with. They are often quick and simple to collect. They include:

  • Total number of records, or rows.
  • Number of columns
  • Storage size in bytes

Attribute statistics

Attribute statistics allow numeric fields to be described:

  • mean (average)
  • mode
  • median
  • variance
  • minimum value
  • maximum value

Multivariate statistics

Multivariate statistics can show the relationship between two attributes. A high correlation between two attributes can lead to poor model performance. Once identified a decision can be made to exclude one of the attributes, or to join them into a single attribute. Scatter Plots can be used to spot relationships between more than two numerical values. This helps to spot special relationships. Another technique is Correlation Matrices which allow the linear relationship between variables to be quantified: 1 or -1 is a strong relationship and 0 is no relationship.

Data cleansing

Wikipedia definition:

Data quality criteria

Before data can be cleaned you need to be able to identify if it needs to be cleaned. This can be done be examining the Data Quality criteria:

  • Validity
  • Accuracy
  • Completeness
  • Consistency
  • Uniformity


It is possible for data to be correct, but invalid. For example this is a correct date 1900-01-01. However if this was the date a document was received it is likely to be incorrect, unless you are conducting historical research of the beginning of the twentieth century.


To determine if data is accurate you have to have an external data source to compare it to. For example, to know if an address reference code (zip code) is correct for an address the data would need to be compared to a database of addresses and code.


Data may have missing values. If the data was derived from another source you may be able to go back and try to get the data you need or at least understand why it was missing. Otherwise you can either leave it blank or try to replace it with a value derived from statistical analysis of all the other values in this field. See the section on handling missing values in this Study Guide.


Consistency is a measure of how the same data collected by different systems is the same or different. For example a Bank will collect a persons age in many different financial products and situations. If the persons age is the same in all systems in all places then it is consistent. However a person may give a different age when talking to someone in a call centre than when buying a pension where the retirement age is important to them.


Different units of measure are used in different countries. Dates also have different formats and may even refer to different calendars. When data is brought together they must all conform to a common unit of measure to be processed together.

Data cleansing techniques

Data cleansing article:

Irrelevant data

By understanding your data and framing the business problem you can identify fields and rows that cannot contribute to the final answer. So if you are searching for ideal colour combinations in womens winter wear you can probably safely drop all records for men.


Repeated data can appear when combining datasets from different systems or from human input. The record or fields may not be completely identical. You will need to compare records on important fields or with a fuzzy match.

Type conversion

Many data types may appear as text. Often this is not as useful as if they were in their original format for example a number in a numeric format can be statistically analysed, but this cannot be done if is is a string.

Syntax errors

Syntax errors often appear in manually entered data sources where there can be some scope to enter correct data in a variety of ways. Validation in the user interface should minimise this.


Ensure all fields have consistent units. This is particularly important with internationally derived data where different measuring systems are used such as American or Metric weights and measures.

Missing values

Missing values and outliers can distort results and therefore effect how the Machine Learning model makes predictions. By understanding your data you can find out the reason behind the data’s absence. The empty field could be an optional field from customer data input, or it could be a mutually exclusive field paired with another where only one contains data. A field with missing data could be null or have an indicator to show the data is absent for example:

  • NaN (Not a Number)
  • NA (Not applicable)
  • None
  • ?

Missing data could be ignored, or the whole record could be dropped. However this could distort the results. The alternative is to replace the data. There are four methods that can be employed to replace missing data:

  • Mean – use the average value to replace missing values.
  • Median -use the middle value of all values to replace missing values.
  • Mode – use the most common value to replace missing values.
  • Use a Machine Learning algorithm to predict the missing values depending on other values in the data.

Inferential statistics can make predictions about data based on a sample of training data. This is a Machine Learning algorithm. So by using trained Machine Learning models data can be cleaned and prepared for later processing by the chosen Machine Learning model.

Cleansing streaming data

Kinesis Data Analytics

Streaming ETL applications allow you to clean the streaming data as it is collected. Kinesis data analytics have an SQL interface that enables you to build a complete cleansing app so that all data you pass on is clean data. The advantage of cleaning data at the point of entry is that you do not have to develop large powerful cleansing apps to trawl through the landed data. Not only do you save on this processing, the landed data is ready sooner for further preparation or ingestion onto the Machine Learning model.

Kinesis Data Firehose

AWS icon for Kinesis Data Firehose

Kinesis Data Firehose can process streaming data with Lambda functions to clean the data.

Dataset generation

Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a service you can use to manually label data. It provides a configurable workflow process to manage the manual labeling of data. The work force that does the manual labeling can be:

  1. You provide the workforce
  2. Third party vendor
  3. Amazon Mechanical Turk

Amazon Mechanical Turk

Whilst we aim to automate everything, there are still some task that are best done by humans. MTurk is an API and a service from Amazon that acts as in interface between the Requester and people willing to engage in Human Intelligent Tasks (HIT) for payment.

This work is often for repetitive manual tasks that occur infrequently which would require a business to take on temporary workers or try to outsource the tasks. With MTurk this is no longer necessary as the API allows a business to access a world wide labour force. MTurk also allows a business to engage directly with workers to cut out the middle man to get the best price. Because this is an on-demand service you only pay for what you use and avoid the responsibilities of being an employer.

Video: Build Highly Accurate Training Datasets Using Amazon SageMaker Ground Truth

This AWS video by Kate Werling is 28 minutes long, with the first 13 minutes spent introducing Ground Truth followed by two demos. Here are the timings, in minutes, so you can select the parts most relevant to your studies:

  • 0 – How can we build Machine Learning Models faster
  • 4.45 – What kind of training data do I need?
  • 7.10 – What is supervised learning?
  • 9.19 – Amazon SageMaker Ground Truth, how it works.
  • 10.18 – Amazon Mechanical Turk mentioned
  • 13.22 – Demo – Mechanical Turk labeling images
  • 21.28 – Demo – Aerial photography
  • 33.10 – AWS Marketplace
  • 34.20 Amazon SageMaker Endpoints
  • 38.39 – End
This AWS video is 38.39 minutes long.

Data argumentation

Data argumentation techniques are used to increase the amount of data available for training a Machine Learning model. Original data is changed slightly, or data is synthesised using existing data as a model. The new data will allow a model to be trained to include features that available training data does not contain, but could do. A larger volume of training data will enable the Machine Learning algorithm a greater chance of identifying the bird in different circumstances.

You could train a Machine Learning model to recognise a rare bird for a nature conservancy project. You may only have a few images of the bird which would not allow the Machine Learning algorithm to be effectively trained so you could create more images using image editing software by changing the size or orientation of the bird or synthesising the effects of different weather and lighting.


Descriptive statistics are used to allow you to understand your data. This is the important first step for data cleansing and preparation. Once the data is understood data cleansing techniques can be used to improve the data quality. The quantity and richness of the data can be improved using Amazon Mechanical Turk, Data Argumentation and SageMaker GroundTruth.


  • This Study Guide is for the sub-domain 2.1 Sanitize and prepare data for modeling. Data sanitation refers to procedures to securely destroy data and media. So I have used the term data cleansing since this is relevant to Machine Learning.
  • The AWS Exam Readiness course mentioned Kinesis Video Streams as Other topics: Data Generation. I have not identified how Kinesis Video Streams can be used for this purpose.

Image by Ri Butov from Pixabay 

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.

Whizlabs AWS Certified Machine Learning Specialty

Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs

Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.

Practice test content

  • Free Practice test – 15 questions
  • Practice test 1 – 65 questions
  • Practice test 2 – 65 questions
  • Practice test 3 – 65 questions
Whizlabs AWS certified machine learning course with a robot hand

Section test content

  • Core ML Concepts – 10 questions
  • Data Engineering – 11 questions
  • Exploratory Data Analysis – 13 questions
  • Modeling – 15 questions
  • Machine Learning Implementation and Operations – 12 questions

Test app questions and answers

Created on By Michael Stainsbury

2.1 Data cleansing and preparation for modeling

Fve questions from a test bank of 14 questions for for Data cleansing and preparation for modeling which is part of sub-domain 2.1, of the Exploratory Data Analysis knowledge domain.

1 / 5

What is Data Augmentation?

2 / 5

What do Multivariate statistics show?

3 / 5

What attribute statistics allow numeric fields to be described?

4 / 5

Why are numeric data in text fields converted to numeric data types?

5 / 5

Which data quality criteria could describe a customer age of 175 years?

Your score is

The average score is 70%


Whizlab’s AWS Certified Machine Learning Specialty course

  • In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
  • Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
  • Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.
Whizlabs AWS certified machine learning course with a robot hand

Course content

The course has 3 resources which can be purchased seperately, or together:

  • 9 Practice tests with 271 questions
  • Video course with 65 videos
  • 9 hands on labs

Similar Posts