Balls of bread dough symbolizing batch processing for Machine Learning

Batch processing for Machine Learning

For Machine Learning AWS glue and AWS Database Migration Service are used to ingest data. Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute service can be used for batch processing. A common choice for Machine Learning is Glue which is used to perform ETL as a batch process. If the source data is in a database, or non-S3 data repository, AWS Database Migration Service can be used to extract the data so it can be processed by batch processing.

This topic, which is part of sub domain 1.2, Identify and implement a data-ingestion solution, focuses on batch processing to ingest data. For ingestion of streaming data see: Streaming data for Machine Learning

Questions

Scroll to the bottom of the page for questions and answers.

Batch processing using AWS Glue

Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute service can be used for batch processing. A common choice for Machine Learning is Glue which is used to perform ETL as a batch process. If the source data is in a database, or non-S3 data repository, AWS Database Migration Service can be used to extract the data so it can be processed by batch processing.

Analysing the data structure

AWS Glue is an ETL service, which stands for Extract, Transform and Load. In this case we are concerned with using it as an extraction tool to ingest the data. The process starts with a Glue Crawler. The Crawler is used to determine the data structure or schema of the data to be ingested. This information is used to create a Glue database and tables. The Glue Database is part of the Glue Data Catalogue, there is one in each region. The Glue database is an Apache Hive metastore. To understand the data structure Glue Crawlers use Data Classifiers. Each built in Classifier is specific to a file type or data store, such as a database or a JSON file. There is a list of built in Data Classifiers here: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

If you have a data structure not listed you can create a custom Classifier. During processing the Crawler tests custom Classifiers first and the the built in Classifiers in order until one is found that can decode the data structures. This is then used to create the Glue Database.

Moving the data

The Glue Job is a PySpark or Python program that can access the source data in the Glue Databases. In the background Glue resources a Spark Cluster to perform the processing for data ingestion. This will be to move the data to a Raw Data S3 bucket.

Making data available

Once the data has been moved it is crawled again to load it’s structure into a Glue database ready for further processing. Whilst Glue has only been discussed as a data transfer tool it also can be used for data transformation. This capability is covered in the Data transformation for Machine Learning study guide.

Scheduling the batch process

Glue Triggers are used to schedule the load process either by a time schedule or by detecting the completion of a previous Glue Job or Glue Crawler. Glue Triggers can also be orchestrated as part of a Glue Workflow.

AWS Database Migration Service for data ingestion

AWS Database Migration Service is designed to transfer data between databases. There is a long list of data sources including RDS, S3 and IBM and SAP. Since it can also output data to S3, DMS can be used as a data ingestion tool.

The source database can be:

  • RDS
  • EC2 instance
  • On premises

The transfer is by transactions, so it is reliable and you can be confident that all the data has been fully transferred. If there is a failure it will roll back any records in transit.

Database Migration Service can be used for both once off migration or it can be configured to move data to a schedule or continuous data replication where any in the source data are transferred as they are made.

Summary

AWS Glue is a popular choice for ingesting data as a batch process. The Glue Crawler enables data in many different formats to be processed. Processing power is provided by a spark cluster and Python or Scalar give programming flexibility. AWS Database Migration Service is ideal for extracting data from a database to load into S3.

Credits

AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


3 questions and answers

35
Created on By Michael Stainsbury

1.2 Batch processing for Machine Learning (full)

This test is for Batch Processing for Machine Learning study guide. Batch Processing is part of sub-domain 1.2, Identify and implement a data-ingestion solution, of the Data Engineering knowledge domain.

1 / 3

What does a Glue Crawler do?

2 / 3

What does DMS do?

Glue components are:

  1. Glue Jobs
  2. Glue Crawlers
  3. Glue Databases
  4. Glue Tables
  5. Glue Triggers
  6. Glue Workflows

3 / 3

What are the components of AWS Glue?

Your score is

The average score is 65%

0%


Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Pluralsight’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

static image of cv library ad showing a blue owl and the text looking for you next job? Register cv
Reviews
CV Library

If you want to land your dream AWS job you have to do more than just dream about it you need a CV. Agents may call, email or text and job ads pop up on every site you visit but the first thing they will ask for is a copy of your CV. A CV…


Similar Posts

One Comment

Comments are closed.