Batch processing for Machine Learning
For Machine Learning AWS glue and AWS Database Migration Service are used to ingest data. Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute service can be used for batch processing. A common choice for Machine Learning is Glue which is used to perform ETL as a batch process. If the source data is in a database, or non-S3 data repository, AWS Database Migration Service can be used to extract the data so it can be processed by batch processing.
This topic, which is part of sub domain 1.2, Identify and implement a data-ingestion solution, focuses on batch processing to ingest data. For ingestion of streaming data see: Streaming data for Machine Learning
Questions
Scroll to the bottom of the page for questions and answers.
Batch processing using AWS Glue
Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute service can be used for batch processing. A common choice for Machine Learning is Glue which is used to perform ETL as a batch process. If the source data is in a database, or non-S3 data repository, AWS Database Migration Service can be used to extract the data so it can be processed by batch processing.
Analysing the data structure
AWS Glue is an ETL service, which stands for Extract, Transform and Load. In this case we are concerned with using it as an extraction tool to ingest the data. The process starts with a Glue Crawler. The Crawler is used to determine the data structure or schema of the data to be ingested. This information is used to create a Glue database and tables. The Glue Database is part of the Glue Data Catalogue, there is one in each region. The Glue database is an Apache Hive metastore. To understand the data structure Glue Crawlers use Data Classifiers. Each built in Classifier is specific to a file type or data store, such as a database or a JSON file. There is a list of built in Data Classifiers here: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
If you have a data structure not listed you can create a custom Classifier. During processing the Crawler tests custom Classifiers first and the the built in Classifiers in order until one is found that can decode the data structures. This is then used to create the Glue Database.
Moving the data
The Glue Job is a PySpark or Python program that can access the source data in the Glue Databases. In the background Glue resources a Spark Cluster to perform the processing for data ingestion. This will be to move the data to a Raw Data S3 bucket.
Making data available
Once the data has been moved it is crawled again to load it’s structure into a Glue database ready for further processing. Whilst Glue has only been discussed as a data transfer tool it also can be used for data transformation. This capability is covered in the Data transformation for Machine Learning study guide.
Scheduling the batch process
Glue Triggers are used to schedule the load process either by a time schedule or by detecting the completion of a previous Glue Job or Glue Crawler. Glue Triggers can also be orchestrated as part of a Glue Workflow.
AWS Database Migration Service for data ingestion
AWS Database Migration Service is designed to transfer data between databases. There is a long list of data sources including RDS, S3 and IBM and SAP. Since it can also output data to S3, DMS can be used as a data ingestion tool.
The source database can be:
- RDS
- EC2 instance
- On premises
The transfer is by transactions, so it is reliable and you can be confident that all the data has been fully transferred. If there is a failure it will roll back any records in transit.
Database Migration Service can be used for both once off migration or it can be configured to move data to a schedule or continuous data replication where any in the source data are transferred as they are made.
Summary
AWS Glue is a popular choice for ingesting data as a batch process. The Glue Crawler enables data in many different formats to be processed. Processing power is provided by a spark cluster and Python or Scalar give programming flexibility. AWS Database Migration Service is ideal for extracting data from a database to load into S3.
Credits
- Photo by Victor Rodríguez Iglesias on Unsplash
- AWS icons: Downloaded from https://aws.amazon.com/architecture/icons/
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)
3 questions and answers

Pluralsight review – AWS Certified Machine Learning Specialty
Contains affiliate links. If you go to Pluralsight’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

Whizlabs review – AWS Certified Machine Learning Specialty
Need more practice with the exams? Check out Whizlab’s free test with 15 questions. They also have three practice tests (65 questions each) and five section tests (10-15 questions each). Money off promo codes are below. For the AWS Certified Machine Learning Specialty Whizlabs provides a practice tests, a video course and hands-on labs. These…

CV Library
If you want to land your dream AWS job you have to do more than just dream about it you need a CV. Agents may call, email or text and job ads pop up on every site you visit but the first thing they will ask for is a copy of your CV. A CV…
You have made some decent points there. I checked on the internet for more information about the issue and found most people will go along with your
views on this site.