Data transformation for Machine Learning
This Study Guide is about transforming raw data so it is ready for Machine Learning. There are two types of transformation:
- Changing data structure
- Cleaning data
Identify and implement a data-transformation solution is sub-domain 1.3 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus
Questions
To confirm your understanding scroll to the bottom of the page for 10 questions and answers.
Changing structure
Amazon SageMaker implements a wide range of algorithms. Each one has specific requirements for how it’s input data should be structured. However the raw data you receive may be in a very different structure and even a different format. Using data transformation techniques the data structure is changed to a structure that can be ingested by SageMaker or another machine Learning from AWS.
The raw data you receive may be in a very structure or format that cannot be used directly by a Machine Learning algorithm. Amazon SageMaker implements a wide range of algorithms, each one has specific requirements for how it’s input data should be structured. By using data transformation techniques the data structure can be changed to a structure that can be ingested by SageMaker.
Cleaning data
To get good generalisations out of a Machine Learning model you have to have clean data going in. Dirty data can distort results and hide important conclusions.
- Inconsistent Schema – the names and order of fields varies between files of the same data.
- Extraneous Text – additional unnecessary information in a field
- Missing Data – a field that should contain data is empty or contains a data not available indicator
- Redundant Information – the same data is available in more than one place in a record
- Contextual Errors – the data may be a correct value, but in the real world context it is wrong
- Junk Values – meaningless data in a field
- More on Cleaning Data: Data cleansing and preparation for modeling
- Further reading: Clean data is the foundation of effective machine learning

Data transformation services
If the raw data is large you will need to use a service that has the compute power and architecture to get the work done quickly.
Apache Spark on EMR

There are several systems that can run on Amazon EMR, one is Apache Spark. EMR acts as a performance-optimised runtime environment for Apache Spark.
By using a directed acyclic graph (DAG) execution engine, Spark can create efficient query plans for data transformations. Spark also stores input, output, and intermediate data in-memory as resilient dataframes, which allows for fast processing without I/O cost, boosting performance of iterative or interactive workloads.
AWS website: Apache Spark on Amazon EMR
EMR can process static or real-time streaming data.
- AWS docs: Apache Spark on Amazon EMR
- AWS FAQs: Amazon EMR FAQs
Java, Scala, SQL, and Python are supported, with Python and Jupyter Notebooks being a popular choice for Machine Learning.
Apache Spark on EMR is discussed further in The Machine Learning Production Environment
Apache Spark with Amazon SageMaker

Apache Spark can also be run with SageMaker. In this configuration Spark is used for pre-processing data and SageMaker for model training and hosting. The pre-processing would be data transformation and data cleansing activities.
Amazon Athena

Amazon Athena allows you to use standard SQL on datasets in S3. For Athena to understand the structure of the S3 data a Data Catalogue must be available. In the past Athena had it’s own Data Catalogue, however now Glue is used to provide the Data Catalogue.
- Amazon Athena – Serverless Interactive Query Service
- Amazon Athena FAQs – Serverless Interactive Query Service
Amazon Athena uses Presto, a distributed query engine, provides fast access to data in these formats:
- CSV
- JSON
- ORC
- Avro
- Parquet
Federated queries and SageMaker integration are in preview and so will not be part of the exam.
AWS Glue

AWS Glue is Amazon’s ETL (Extract, Transform, Load) service. As you can expect, a service designed to be flexible enough to handle almost any input data, perform almost any transformation and output data in almost any format and do all this very fast with massive workloads, will be very complex and have many components. To make it easier to understand we will use an example process and describe what and how Glue performs each stage.
- AWS Glue – Managed ETL Service
- AWS Glue FAQs – Managed ETL Service
- Video: https://www.youtube.com/watch?v=Qpv7BzOM-UI
Workflow
A Glue Workflow is a new Glue feature that can orchestrate the activities of Glue Jobs, Crawlers and Triggers. In a typical workflow, or pipeline data is transferred from a Raw S3 bucket to a Processed S3 bucket undergoing transformation during this process. The processed output data is ready for Machine Learning processing.
Extract
The incoming data arrives in the landing bucket. In this example the landing event does not start the Glue workflow, the workflow stars on a schedule which has been configured into the Glue Trigger named Landing Trigger. At a specific time, in this case at 6 am, the Glue Trigger causes the Glue Crawler named Landing Crawler to begin processing the files in the landing S3 bucket. This populates the Glue Database.
Glue Trigger
A Glue Trigger is a configurable Glue item that can make Glue Crawlers and Jobs start processing. Glue Triggers can be configured to start on a schedule, or because it has detected another event, for example a Crawler or Job has completed processing. In this example the Landing Trigger is schedule to start at a specific time.
Glue Crawler
A Glue Crawler is a configurable Glue item that can search a data source from a database or, as in this example, a S3 bucket and populate a Glue Database with what it finds. This is the first stage of data transformation, until Glue understands what the source data looks like no further processing can be done. To specify what files in the S3 bucket, or tables in a database will be crawled exclusion parameters are used, defining what should be ignored. To recognise the format of the data the Crawler uses a prioritised list of Data Classifiers. Once the data format is identified the Crawler analyses the structure of the data and creates tables in a Glue Database.
Glue Database
The Glue Database comprises of tables that have been created by a Glue Crawler. These tables describe the data structure and are used to retrieve the data and manipulate it during data transformation. The Glue Database sits in the Glue Data Catalogue. There is one per region, per account that holds all the Glue Databases that have been created by Glue Crawlers. Athena used to have it’s own Data Catalogue, but this is being phased out. Going forwards this will be the only Data Catalogue used by all AWS services.
Video – Getting started with AWS Glue Data Catalog
Transform
The Glue Job is the Glue item that processes the data and performs data transformation and data cleansing. Under the hood Glue fires up an Apache Spark cluster, this is where the processing power comes from. The cluster is abstracted away as part of the Glue service. The only parameters you can change is the instance type used by the workers and the maximum number of workers Glue can create. Programs executed by Glue may be written in PySpark and Python. Python is a popular choice for Machine Learning.
Load
In this example the data is output to an S3 bucket ready to be used by other services in the Machine Learning pipeline. Some services can ingest the files directly, but others can only access it if it’s structure is in the Data Catalogue. So once the Glue Job has completed a Glue Trigger fires and starts a Glue Crawler to analyse the files in the processed S3 bucket. The Crawler populates tables in a new Glue Database. When finished the data can be used by services that use a data Catalogue, such as Athena and Redshift Spectrum.
Video – Getting started with AWS Glue ETL
Amazon Redshift Spectrum

Amazon Redshift Spectrum can be used for ad-hoc ETL processing which includes data transformation and cleansing. S3 is used as the data repository however before it can be accessed it’s structure must be defined in a Data Catalogue. Glue Crawlers are used to discover the data’s format and structure and populate tables in the Data Catalogue. Redshift Spectrum can use the catalogue to access the raw data files in S3 using standard SQL queries to clean and transform the structure of the data.
- AWS docs: Getting started with Amazon Redshift Spectrum
- AWS blog: Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift
Summary
Data transformation and cleaning is a process that takes raw source data and changes it into a state that is ready for further processing in the Machine Learning pipeline. Data transformation changes the structure of the data and data cleansing corrects problems with the data that would adversely affect the results of Machine Learning processing. The AWS services used for this activity are Athena, Glue, Redshift Spectrum, Apache Spark on EMR and Spark on SageMaker.
Credits
- Photograph: Photo by Quino Al on Unsplash
- AWS icons: Downloaded from https://aws.amazon.com/architecture/icons/
Infographic icon – cleaning data
- data cleaning by Annette Spithoven from the Noun Project
- missing data by LAFS from the Noun Project
- junk by Eucalyp from the Noun Project
- old man by monkik from the Noun Project
- Double Umbrella by Michael A. Salter from the Noun Project
- Be different by Victoruler from the Noun Project
- books by Olga from the Noun Project
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)
10 questions and answers

Amazon Study Guide review – AWS Certified Machine Learning Specialty
This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

Whizlabs review – AWS Certified Machine Learning Specialty
Need more practice with the exams? Check out Whizlab’s free test with 15 questions. They also have three practice tests (65 questions each) and five section tests (10-15 questions each). Money off promo codes are below. For the AWS Certified Machine Learning Specialty Whizlabs provides a practice tests, a video course and hands-on labs. These…

CV Library
If you want to land your dream AWS job you have to do more than just dream about it you need a CV. Agents may call, email or text and job ads pop up on every site you visit but the first thing they will ask for is a copy of your CV. A CV…