Potters wheel with clay pot being made to symbolise data transformation for Machine Learning

Data transformation for Machine Learning

This Study Guide is about transforming raw data so it is ready for Machine Learning. There are two types of transformation:

  1. Changing data structure
  2. Cleaning data

Identify and implement a data-transformation solution is sub-domain 1.3 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus

Questions

To confirm your understanding scroll to the bottom of the page for 10 questions and answers.

Changing structure

Amazon SageMaker implements a wide range of algorithms. Each one has specific requirements for how it’s input data should be structured. However the raw data you receive may be in a very different structure and even a different format. Using data transformation techniques the data structure is changed to a structure that can be ingested by SageMaker or another machine Learning from AWS.

The raw data you receive may be in a very structure or format that cannot be used directly by a Machine Learning algorithm. Amazon SageMaker implements a wide range of algorithms, each one has specific requirements for how it’s input data should be structured. By using data transformation techniques the data structure can be changed to a structure that can be ingested by SageMaker.

Cleaning data

To get good generalisations out of a Machine Learning model you have to have clean data going in. Dirty data can distort results and hide important conclusions.

  1. Inconsistent Schema – the names and order of fields varies between files of the same data.
  2. Extraneous Text – additional unnecessary information in a field
  3. Missing Data – a field that should contain data is empty or contains a data not available indicator
  4. Redundant Information – the same data is available in more than one place in a record
  5. Contextual Errors – the data may be a correct value, but in the real world context it is wrong
  6. Junk Values – meaningless data in a field
This infographic shows data problems that need data cleaning before the data is passed on to Machine Learning model training and production
Add this revision card to your Pinterest account

Data transformation services

If the raw data is large you will need to use a service that has the compute power and architecture to get the work done quickly.

Apache Spark on EMR

There are several systems that can run on Amazon EMR, one is Apache Spark. EMR acts as a performance-optimised runtime environment for Apache Spark.

By using a directed acyclic graph (DAG) execution engine, Spark can create efficient query plans for data transformations. Spark also stores input, output, and intermediate data in-memory as resilient dataframes, which allows for fast processing without I/O cost, boosting performance of iterative or interactive workloads.

AWS website: Apache Spark on Amazon EMR

EMR can process static or real-time streaming data.

Java, Scala, SQL, and Python are supported, with Python and Jupyter Notebooks being a popular choice for Machine Learning.

Apache Spark on EMR is discussed further in The Machine Learning Production Environment

Apache Spark with Amazon SageMaker

Apache Spark can also be run with SageMaker. In this configuration Spark is used for pre-processing data and SageMaker for model training and hosting. The pre-processing would be data transformation and data cleansing activities.

Amazon Athena

Amazon Athena allows you to use standard SQL on datasets in S3. For Athena to understand the structure of the S3 data a Data Catalogue must be available. In the past Athena had it’s own Data Catalogue, however now Glue is used to provide the Data Catalogue.

Amazon Athena uses Presto, a distributed query engine, provides fast access to data in these formats:

  • CSV
  • JSON
  • ORC
  • Avro
  • Parquet

Federated queries and SageMaker integration are in preview and so will not be part of the exam.

AWS Glue

AWS Glue is Amazon’s ETL (Extract, Transform, Load) service. As you can expect, a service designed to be flexible enough to handle almost any input data, perform almost any transformation and output data in almost any format and do all this very fast with massive workloads, will be very complex and have many components. To make it easier to understand we will use an example process and describe what and how Glue performs each stage.

Workflow

A Glue Workflow is a new Glue feature that can orchestrate the activities of Glue Jobs, Crawlers and Triggers. In a typical workflow, or pipeline data is transferred from a Raw S3 bucket to a Processed S3 bucket undergoing transformation during this process. The processed output data is ready for Machine Learning processing.

Extract

The incoming data arrives in the landing bucket. In this example the landing event does not start the Glue workflow, the workflow stars on a schedule which has been configured into the Glue Trigger named Landing Trigger. At a specific time, in this case at 6 am, the Glue Trigger causes the Glue Crawler named Landing Crawler to begin processing the files in the landing S3 bucket. This populates the Glue Database.

Glue Trigger

A Glue Trigger is a configurable Glue item that can make Glue Crawlers and Jobs start processing. Glue Triggers can be configured to start on a schedule, or because it has detected another event, for example a Crawler or Job has completed processing. In this example the Landing Trigger is schedule to start at a specific time.

Glue Crawler

A Glue Crawler is a configurable Glue item that can search a data source from a database or, as in this example, a S3 bucket and populate a Glue Database with what it finds. This is the first stage of data transformation, until Glue understands what the source data looks like no further processing can be done. To specify what files in the S3 bucket, or tables in a database will be crawled exclusion parameters are used, defining what should be ignored. To recognise the format of the data the Crawler uses a prioritised list of Data Classifiers. Once the data format is identified the Crawler analyses the structure of the data and creates tables in a Glue Database.

Glue Database

The Glue Database comprises of tables that have been created by a Glue Crawler. These tables describe the data structure and are used to retrieve the data and manipulate it during data transformation. The Glue Database sits in the Glue Data Catalogue. There is one per region, per account that holds all the Glue Databases that have been created by Glue Crawlers. Athena used to have it’s own Data Catalogue, but this is being phased out. Going forwards this will be the only Data Catalogue used by all AWS services.

Video – Getting started with AWS Glue Data Catalog
Getting Started with AWS Glue Data Catalog, 5.20 minutes

Transform

The Glue Job is the Glue item that processes the data and performs data transformation and data cleansing. Under the hood Glue fires up an Apache Spark cluster, this is where the processing power comes from. The cluster is abstracted away as part of the Glue service. The only parameters you can change is the instance type used by the workers and the maximum number of workers Glue can create. Programs executed by Glue may be written in PySpark and Python. Python is a popular choice for Machine Learning.

Load

In this example the data is output to an S3 bucket ready to be used by other services in the Machine Learning pipeline. Some services can ingest the files directly, but others can only access it if it’s structure is in the Data Catalogue. So once the Glue Job has completed a Glue Trigger fires and starts a Glue Crawler to analyse the files in the processed S3 bucket. The Crawler populates tables in a new Glue Database. When finished the data can be used by services that use a data Catalogue, such as Athena and Redshift Spectrum.

Video – Getting started with AWS Glue ETL
Getting started with AWS Glue ETL, 6.23 minutes

Amazon Redshift Spectrum

Amazon Redshift Spectrum can be used for ad-hoc ETL processing which includes data transformation and cleansing. S3 is used as the data repository however before it can be accessed it’s structure must be defined in a Data Catalogue. Glue Crawlers are used to discover the data’s format and structure and populate tables in the Data Catalogue. Redshift Spectrum can use the catalogue to access the raw data files in S3 using standard SQL queries to clean and transform the structure of the data.

Summary

Data transformation and cleaning is a process that takes raw source data and changes it into a state that is ready for further processing in the Machine Learning pipeline. Data transformation changes the structure of the data and data cleansing corrects problems with the data that would adversely affect the results of Machine Learning processing. The AWS services used for this activity are Athena, Glue, Redshift Spectrum, Apache Spark on EMR and Spark on SageMaker.

Credits

Infographic icon – cleaning data


AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


10 questions and answers

24
Created on By Michael Stainsbury

1.3 Data transformation for Machine Learning (full)

This test has 10 questions. It tests knowledge of the study guide for sub-domain 1.3, Identify and implement a data-transformation solution, of the Data Engineering knowledge domain.

1 / 10

What can Athena do?

2 / 10

Which of these will require data cleaning techniques to be used?

3 / 10

What processing must be performed before Amazon Redshift Spectrum can be used for ad-hoc ETL processing?

4 / 10

How is metadata loaded into the Data Catalogue?

5 / 10

6 / 10

Which AWS services are available to prepare data for Machine Learning?

7 / 10

Dirty data is a problem because …

8 / 10

9 / 10

What aspect of data preparation could EMR be used for?

Your score is

The average score is 69%

0%


Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *