Data cleansing and preparation for modeling
Understanding data, cleansing data and dataset generation are important first steps in exploratory data analysis. Every other phase in the Machine Learning process relies on the data being cleaned and prepared. This Study Guide starts with statistical techniques used to help understand the data. Once data is understood it has to be cleaned up so that Machine Learning algorithms can operate effectively. Sometimes the quantity of data for training is insufficient or the production data needs to be enhanced. To do this services and techniques for dataset generation and data augmentation are used.
Sanitize and prepare data for modeling is sub-domain 2.1 of the Exploratory Data Analysis knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus
To confirm your understanding scroll to the bottom of the page for the test app
Understand your data
Before you can sanitize and prepare your data you must first understand your data. The data can be described using Descriptive Statistics. These statistics are solely concerned with describing the data we have. This type of statistics does not assume the data comes from a larger pool or population of data.
- Descriptive statistics (https://en.wikipedia.org/wiki/Descriptive_statistics)
The statistics fall into three groups:
- Overall statistics
- Multivariate statistics
- Attribute statistics
Overall statistics give you an idea of the size what you are dealing with. They are often quick and simple to collect. They include:
- Total number of records, or rows.
- Number of columns
- Storage size in bytes
Attribute statistics allow numeric fields to be described:
- mean (average)
- minimum value
- maximum value
Multivariate statistics can show the relationship between two attributes. A high correlation between two attributes can lead to poor model performance. Once identified a decision can be made to exclude one of the attributes, or to join them into a single attribute. Scatter Plots can be used to spot relationships between more than two numerical values. This helps to spot special relationships. Another technique is Correlation Matrices which allow the linear relationship between variables to be quantified: 1 or -1 is a strong relationship and 0 is no relationship.
Wikipedia definition: https://en.wikipedia.org/wiki/Data_cleansing
Data quality criteria
Before data can be cleaned you need to be able to identify if it needs to be cleaned. This can be done be examining the Data Quality criteria:
It is possible for data to be correct, but invalid. For example this is a correct date 1900-01-01. However if this was the date a document was received it is likely to be incorrect, unless you are conducting historical research of the beginning of the twentieth century.
To determine if data is accurate you have to have an external data source to compare it to. For example, to know if an address reference code (zip code) is correct for an address the data would need to be compared to a database of addresses and code.
Data may have missing values. If the data was derived from another source you may be able to go back and try to get the data you need or at least understand why it was missing. Otherwise you can either leave it blank or try to replace it with a value derived from statistical analysis of all the other values in this field. See the section on handling missing values in this Study Guide.
Consistency is a measure of how the same data collected by different systems is the same or different. For example a Bank will collect a persons age in many different financial products and situations. If the persons age is the same in all systems in all places then it is consistent. However a person may give a different age when talking to someone in a call centre than when buying a pension where the retirement age is important to them.
Different units of measure are used in different countries. Dates also have different formats and may even refer to different calendars. When data is brought together they must all conform to a common unit of measure to be processed together.
Data cleansing techniques
Data cleansing article: https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4
By understanding your data and framing the business problem you can identify fields and rows that cannot contribute to the final answer. So if you are searching for ideal colour combinations in womens winter wear you can probably safely drop all records for men.
Repeated data can appear when combining datasets from different systems or from human input. The record or fields may not be completely identical. You will need to compare records on important fields or with a fuzzy match.
Many data types may appear as text. Often this is not as useful as if they were in their original format for example a number in a numeric format can be statistically analysed, but this cannot be done if is is a string.
Syntax errors often appear in manually entered data sources where there can be some scope to enter correct data in a variety of ways. Validation in the user interface should minimise this.
Ensure all fields have consistent units. This is particularly important with internationally derived data where different measuring systems are used such as American or Metric weights and measures.
Missing values and outliers can distort results and therefore effect how the Machine Learning model makes predictions. By understanding your data you can find out the reason behind the data’s absence. The empty field could be an optional field from customer data input, or it could be a mutually exclusive field paired with another where only one contains data. A field with missing data could be null or have an indicator to show the data is absent for example:
- NaN (Not a Number)
- NA (Not applicable)
Missing data could be ignored, or the whole record could be dropped. However this could distort the results. The alternative is to replace the data. There are four methods that can be employed to replace missing data:
- Mean – use the average value to replace missing values.
- Median -use the middle value of all values to replace missing values.
- Mode – use the most common value to replace missing values.
- Use a Machine Learning algorithm to predict the missing values depending on other values in the data.
Inferential statistics can make predictions about data based on a sample of training data. This is a Machine Learning algorithm. So by using trained Machine Learning models data can be cleaned and prepared for later processing by the chosen Machine Learning model.
- Inferential statistics (https://en.wikipedia.org/wiki/Statistical_inference)
Cleansing streaming data
Kinesis Data Analytics
Streaming ETL applications allow you to clean the streaming data as it is collected. Kinesis data analytics have an SQL interface that enables you to build a complete cleansing app so that all data you pass on is clean data. The advantage of cleaning data at the point of entry is that you do not have to develop large powerful cleansing apps to trawl through the landed data. Not only do you save on this processing, the landed data is ready sooner for further preparation or ingestion onto the Machine Learning model.
Kinesis Data Firehose
Kinesis Data Firehose can process streaming data with Lambda functions to clean the data.
Amazon SageMaker Ground Truth
Amazon SageMaker Ground Truth is a service you can use to manually label data. It provides a configurable workflow process to manage the manual labeling of data. The work force that does the manual labeling can be:
- You provide the workforce
- Third party vendor
- Amazon Mechanical Turk
Amazon Mechanical Turk
Whilst we aim to automate everything, there are still some task that are best done by humans. MTurk is an API and a service from Amazon that acts as in interface between the Requester and people willing to engage in Human Intelligent Tasks (HIT) for payment.
This work is often for repetitive manual tasks that occur infrequently which would require a business to take on temporary workers or try to outsource the tasks. With MTurk this is no longer necessary as the API allows a business to access a world wide labour force. MTurk also allows a business to engage directly with workers to cut out the middle man to get the best price. Because this is an on-demand service you only pay for what you use and avoid the responsibilities of being an employer.
Video: Build Highly Accurate Training Datasets Using Amazon SageMaker Ground Truth
This AWS video by Kate Werling is 28 minutes long, with the first 13 minutes spent introducing Ground Truth followed by two demos. Here are the timings, in minutes, so you can select the parts most relevant to your studies:
- 0 – How can we build Machine Learning Models faster
- 4.45 – What kind of training data do I need?
- 7.10 – What is supervised learning?
- 9.19 – Amazon SageMaker Ground Truth, how it works.
- 10.18 – Amazon Mechanical Turk mentioned
- 13.22 – Demo – Mechanical Turk labeling images
- 21.28 – Demo – Aerial photography
- 33.10 – AWS Marketplace
- 34.20 Amazon SageMaker Endpoints
- 38.39 – End
Data argumentation techniques are used to increase the amount of data available for training a Machine Learning model. Original data is changed slightly, or data is synthesised using existing data as a model. The new data will allow a model to be trained to include features that available training data does not contain, but could do. A larger volume of training data will enable the Machine Learning algorithm a greater chance of identifying the bird in different circumstances.
You could train a Machine Learning model to recognise a rare bird for a nature conservancy project. You may only have a few images of the bird which would not allow the Machine Learning algorithm to be effectively trained so you could create more images using image editing software by changing the size or orientation of the bird or synthesising the effects of different weather and lighting.
Descriptive statistics are used to allow you to understand your data. This is the important first step for data cleansing and preparation. Once the data is understood data cleansing techniques can be used to improve the data quality. The quantity and richness of the data can be improved using Amazon Mechanical Turk, Data Argumentation and SageMaker GroundTruth.
- This Study Guide is for the sub-domain 2.1 Sanitize and prepare data for modeling. Data sanitation refers to procedures to securely destroy data and media. So I have used the term data cleansing since this is relevant to Machine Learning.
- The AWS Exam Readiness course mentioned Kinesis Video Streams as Other topics: Data Generation. I have not identified how Kinesis Video Streams can be used for this purpose.
Image by Ri Butov from Pixabay
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)
Test app questions and answers
Pluralsight review – AWS Certified Machine Learning Specialty
Contains affiliate links. If you go to Pluralsight’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…
If you want to land your dream AWS job you have to do more than just dream about it you need a CV. Agents may call, email or text and job ads pop up on every site you visit but the first thing they will ask for is a copy of your CV. A CV…
Amazon Study Guide review – AWS Certified Machine Learning Specialty
This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…