This is a summary of the AWS Power Machine Learning at Scale White Paper which is a 15 page pdf document focusing on High Power Computing (HPC) in AWS. It can be downloaded from here:
The list of White Papers for Machine Learning is on the Prepare for Your AWS Certification Exam web page:
AWS is very sparse with it’s description of what is in the exam. So when they identify resources to use as preparation it is a big clue to the questions they may ask. This is why you need to read the White Papers. Five papers are listed as suitable for preparation for the AWS Machine Learning Specialty certification exam:
- Power Machine Learning at Scale
- Managing Machine Learning Projects
- Machine Learning Foundations: Evolution of ML and AI
- Augmented AI: The Power of Human and Machine
- Machine Learning Lens – AWS Well-Architected Framework
The content of this white paper compliments these knowledge domains of the AWS Machine Learning Specialty certification exam:
- Domain 1, Data Engineering
- Sub-domain 1.1 Create data repositories for machine learning
- Sub-domain 1.2 Identify and implement a data-ingestion solution
- Sub-domain 1.3 Identify and implement a data-transformation solution
- Domain 4, Machine Learning Implementation and Operations
Selection criteria for storage
These criteria can be used to select the type of storage needed for Machine Learning. The criteria may be different in each phase to the workflow.
- Data sources
- frequency of revisions or
- updates to the data, and
- where the data will be stored
- durability and availability
- size of the dataset
- performance requirements for reading
- performance requirements for writing
- performance requirements for transferring
The volume of data required for a model
To train a model requires a certain volume of data. The more complex the problem and algorithm the more data is needed. Training data volumes can be estimated using statistical methods that may also take into account the number of classes, input features, and model parameters.
AWS provides ETL services:
- Amazon Athena – for tabular data
- AWS Glue – for non tabular data
- Amazon Redshift Spectrum – for tabular data
The choice of ETL tool depends on the type of data and optimising processing. Athena and Redshift Spectrum use standard SQL to transform tabular data. Where as Glue can perform non-relational data manipulations easier and faster using built in Apache Spark clusters.
Data visualization techniques can be used to check the data quality. AWS offers two services:
- Amazon QuickSight
- Amazon SageMaker Notebooks
These visualization tools will help to identify data quality issues and confirm that the training dataset is representative enough, and will produce good generalized inferences.
Data repositories for performance
Because compute services can process data so fast it is important that the data is fed in to the process fast enough so that the compute services are not left waiting for the data to arrive, this is called a stall. The purpose of the choice of storage is to keep the compute services saturated with data. This is achieved by using high performance parallelized file systems such as:
- Amazon Elastic File System (EFS)
- Amazon FSx for Lustre
How to select storage
Before moving large training datasets to high performance storage you need to check your use case actually needs it. If fast training speeds are required there is a choice of storage services depending on the performance required. The White Paper has a table comparing storage service speeds.
- Amazon S3, Amazon Athena, AWS Glue, Amazon Redshift Spectrum, SageMaker notebooks
- Amazon FSx for Lustre, EFS
- Amazon Elastic Inference
- SageMaker Neo
- Amazon IoT GreenGrass
Distributed Computation Frameworks
Preprocessing data files
Preprocessing data files by splitting them into smaller chunks allows data to be fed in parallel to compute services for efficient processing. This may be essential if the dataset is too big to fit into memory and so cannot be processed in the original form. This processing can be performed by:
- Apache Spark cluster on Amazon EMR
- Amazon SageMaker pipes
- AWS Glue
Running an Apache Spark cluster on Amazon EMR can process massive quantities of data quickly. The processing program can be hosted in the Spark EMR cluster or by connecting a SageMaker Notebook.
CPU and GPU processing
Data preprocessing is performed by CPU instance (producer) whilst data processing uses higher performance GPUs (consumer). AWS Batch, which can have GPU enabled instances, can be used to manage this configuration.
Data loading can be performed using specialised classes from Tensorflow, PyTorch, or MxNet. These frameworks also provide features to develop a processing pipeline. This allows you to process large data volumes in batch.
Vertical scaling is recommended before horizontal scaling because requires a simpler system than horizontal scaling. So if your processing task is not completing within the timeframe your use case requires, increase the power of the instance and use an instance more optimised for the type of processing being performed. Once improving these two parameters has no further effect, then apply horizontal scaling.
- Apache Spark cluster on Amazon EMR
- Amazon SageMaker notebook
- AWS Batch
- SageMaker pipes and AWS Glue
Build Compute Clusters to Fit the Workload
Diagram of the deep learning CFN cluster
Kubernetes on Amazon EKS you create dependency isolation
CloudFormation templates (or other infrastructure as code frameworks) enables you to scale and use resources on demand, which guarantees that your infrastructure is robust.
EC2 Auto Scaling enables the application to scale dynamically
EC2 instance Types
Different EC2 instance types are optimised for different workloads.The P family is highly optimised for Machine Learning and HPC applications, such as deep neural networks, with these features:
- Multiple Graphical Processing Units (GPUs)
- Multiple vCPUs
- Increased network bandwidth
- Local SSD storage
FPGA (Field Programmable Gate Array) family of EC2 instances provide specialised hardware acceleration.
The C family of instances have CPUs enhanced by the deep learning functions in the Intel MKL-DNN library.
- Amazon EKS
- Amazon S3, Amazon EC2, Amazon Virtual Private Cloud (Amazon VPC), Amazon EC2 Auto Scaling, Amazon Elastic Container Service for Kubernetes (Amazon EKS), and AWS Identity and Access Management (IAM) services
Modeling Using Hybrid Infrastructures
Hybrid Infrastructures use a combination of cloud services together with on premises or edge computing locations. Some typical use cases are:
- Facilitate legacy IT applications and data migration
- Extend the compute capacity of an on-premises datacenter
- Backup and disaster recovery solution in the cloud
- Training Machine Learning models in the cloud to use cloud scalability and then run in production on premises.
- High levels of security, for example for National Security applications, may result in only part of the data being exposed to off premises processing in the cloud. So the processing is split between on premises and cloud processes.
- Create, train, and optimize your models in the cloud, and then deploy them for inferencing to edge devices
- AWS IoT Greengrass
- AWS Snowball Edge
- AWS Storage Gateway
- AWS CodeBuild, AWS CodePipeline, Amazon CloudWatch, and AWS Lambda
- AWS Step Functions
- AWS Snowball Edge
This was a summary of the AWS White paper Power Machine Learning at Scale.
Whizlab’s AWS Certified Machine Learning Specialty practice exams
Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.
Practice test content
- Free Practice test – 15 questions
- Practice test 1 – 65 questions
- Practice test 2 – 65 questions
- Practice test 3 – 65 questions
Section test content
- Core ML Concepts – 10 questions
- Data Engineering – 11 questions
- Exploratory Data Analysis – 13 questions
- Modeling – 15 questions
- Machine Learning Implementation and Operations – 12 questions