For Machine Learning AWS glue and AWS Database Migration Service are used to ingest data. Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute service can be used for batch processing. A common choice for Machine Learning is Glue which is used to perform ETL as a batch process. If the source data is in a database, or non-S3 data repository, AWS Database Migration Service can be used to extract the data so it can be processed by batch processing.
This topic, which is part of sub domain 1.2, Identify and implement a data-ingestion solution, focuses on batch processing to ingest data. For ingestion of streaming data see: Streaming data for Machine Learning
Batch processing using AWS Glue
Batch processing refers to processing usually performed to a specific schedule. Before the batch process starts data is waiting and often any new data will have to wait for the next batch processing to be processed. In AWS any compute service can be used for batch processing. A common choice for Machine Learning is Glue which is used to perform ETL as a batch process. If the source data is in a database, or non-S3 data repository, AWS Database Migration Service can be used to extract the data so it can be processed by batch processing.
Analysing the data structure
AWS Glue is an ETL service, which stands for Extract, Transform and Load. In this case we are concerned with using it as an extraction tool to ingest the data. The process starts with a Glue Crawler. The Crawler is used to determine the data structure or schema of the data to be ingested. This information is used to create a Glue database and tables. The Glue Database is part of the Glue Data Catalogue, there is one in each region. The Glue database is an Apache Hive metastore. To understand the data structure Glue Crawlers use Data Classifiers. Each built in Classifier is specific to a file type or data store, such as a database or a JSON file. There is a list of built in Data Classifiers here: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
If you have a data structure not listed you can create a custom Classifier. During processing the Crawler tests custom Classifiers first and the the built in Classifiers in order until one is found that can decode the data structures. This is then used to create the Glue Database.
Moving the data
The Glue Job is a PySpark or Python program that can access the source data in the Glue Databases. In the background Glue resources a Spark Cluster to perform the processing for data ingestion. This will be to move the data to a Raw Data S3 bucket.
Making data available
Once the data has been moved it is crawled again to load it’s structure into a Glue database ready for further processing. Whilst Glue has only been discussed as a data transfer tool it also can be used for data transformation. This capability is covered in the Data transformation for Machine Learning study guide.
Scheduling the batch process
Glue Triggers are used to schedule the load process either by a time schedule or by detecting the completion of a previous Glue Job or Glue Crawler. Glue Triggers can also be orchestrated as part of a Glue Workflow.
AWS Database Migration Service for data ingestion
AWS Database Migration Service is designed to transfer data between databases. There is a long list of data sources including RDS, S3 and IBM and SAP. Since it can also output data to S3, DMS can be used as a data ingestion tool.
The source database can be:
- EC2 instance
- On premises
The transfer is by transactions, so it is reliable and you can be confident that all the data has been fully transferred. If there is a failure it will roll back any records in transit.
Database Migration Service can be used for both once off migration or it can be configured to move data to a schedule or continuous data replication where any in the source data are transferred as they are made.
AWS Glue is a popular choice for ingesting data as a batch process. The Glue Crawler enables data in many different formats to be processed. Processing power is provided by a spark cluster and Python or Scalar give programming flexibility. AWS Database Migration Service is ideal for extracting data from a database to load into S3.
- Photo by Victor Rodríguez Iglesias on Unsplash
- AWS icons: Downloaded from https://aws.amazon.com/architecture/icons/
Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.
Whizlabs AWS Certified Machine Learning Specialty
Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.
Practice test content
- Free Practice test – 15 questions
- Practice test 1 – 65 questions
- Practice test 2 – 65 questions
- Practice test 3 – 65 questions
Section test content
- Core ML Concepts – 10 questions
- Data Engineering – 11 questions
- Exploratory Data Analysis – 13 questions
- Modeling – 15 questions
- Machine Learning Implementation and Operations – 12 questions
3 questions and answers
Whizlab’s AWS Certified Machine Learning Specialty course
- In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
- Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
- Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.
- 9 Practice tests with 271 questions
- Video course with 65 videos
- 9 hands on labs