Data Engineering

mechanical gears to symbolize data engineering

The Data Engineering domain of the AWS Machine Learning Specialist certification exam comprises obtaining the data, transforming it and transferring it to a repository. Twenty percent of the exam marks come from this knowledge domain which is divided into three subdomains.

The data repository (subdomain 1.1) is where the raw and processed data is stored. S3 is the repository of choice for Machine Learning in AWS although some other data stores are also mentioned. The data ingestion subdomain (1.2) is concerned with getting the raw data into the repository. This can be via batch processing or streaming data. With batch processing data is collected and grouped at a point in time and passed to the data store. Streaming data is constantly being collected and fed into the data store. The third subdomain (1.3) focuses on how raw data is transformed into data that can be used for ML processing. The transformation process changes the data structure. The data may also need to be cleaned, de-duplicated, incomplete data managed and have it’s attributes standardised.

Once these data engineering processes are complete the data is ready for further pre-processing prior to being fed into a Machine Learning algorithm. This preprocessing is covered by the second knowledge domain, Exploratory Data Analysis.

  1. For description of the exam structure see this article:
  2. The AWS exam guide pdf can be downloaded from:

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.

Whizlabs AWS Certified Machine Learning Specialty

Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs

Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.

Practice test content

  • Free Practice test – 15 questions
  • Practice test 1 – 65 questions
  • Practice test 2 – 65 questions
  • Practice test 3 – 65 questions
Whizlabs AWS certified machine learning course with a robot hand

Section test content

  • Core ML Concepts – 10 questions
  • Data Engineering – 11 questions
  • Exploratory Data Analysis – 13 questions
  • Modeling – 15 questions
  • Machine Learning Implementation and Operations – 12 questions

Sample Data Engineering questions

This test is five questions randomly taken from 17 questions of the three sub-domains.

Data engineering study guides

Balls of bread dough symbolizing batch processing for Machine Learning
Data Engineering (domain 1)

Batch processing for Machine Learning

For Machine Learning AWS glue and AWS Database Migration Service are used to ingest data. Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute…

a photograph of sauce bottles to symbolise data sources
Data Engineering (domain 1)

Identify data sources

Obtaining large specialised datasets is a must to experiment and train Machine Learning models so they can recognise patterns in real world data and infer a prediction. Datasets can also be used as a source of labeled data to train models to generalise unlabeled real world data. Fortunately there are many data sources for datasets…

A waterfall symbolizing streaming data for Machine Learning
Data Engineering (domain 1)

Streaming data for Machine Learning

Streaming data processing is used when data is continuously being generated and needs to be processed as it arrives. The AWS service for data streaming processing is Kinesis. Kineses comprises of four services each with different capabilities and some that can be used together. As well as Kinesis there is another AWS service that can…

photograph of a waterfall and stream to symbolize Kinesis KPL vs API
Data Engineering (domain 1)

Kinesis KPL vs API

The Kinesis Producer Library (KPL) and the Kinesis API can both be used to send data to Kinesis Data Streams. The advantage of the KPL is it provides a lot of added features, such as failed transmission handling built in. If you use the Kinesis API you have to code these features yourself. The advantages…


MustangJoe on Pixabay