AWS Machine Learning exam guide
What is in the AWS Machine Learning Specialist Certificate exam?
Perhaps exam syllabus is an old fashioned word, which is why AWS do not use it. Specification, and blue print are more current terms, but AWS don’t call it that either. AWS call the description of their course the Exam Guide. This is where the course contents are listed, split into four domains and fifteen sub-domains. This article describes each sub-domain in enough detail for the complete newbie to get a good idea about what it is about. This will give you an overview of what you are getting yourself in to.
Domain 1: Data Engineering
1.1 Create data repositories for machine learning.
AWS Certified Machine Learning Speciality exam guide
Identify data sources (e.g., content and location, primary sources such as user data)
Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)
1.2 Identify and implement a data ingestion solution.
Data job styles/types (batch load, streaming)
Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads)
Kinesis; Kinesis Analytics; Kinesis Firehose; EMR; Glue; Job scheduling
1.3 Identify and implement a data transformation solution.
Transforming data transit (ETL: Glue, EMR, AWS Batch)
Handle ML-specific data using map reduce (Hadoop, Spark, Hive)
This Domain is about getting the data, transforming it and putting it in a repository. It comprises 20% of the exam marks. There are three sub-domains that can be summarised as:
The data repository (sub-domain 1.1) is where you store the raw and processed data. S3 is the repository of choice for Machine Learning in AWS although some other data stores are also mentioned. The data ingestion sub-domain (1.2) is concerned with getting the raw data into the repository. This can be via batch processing or streaming data. With batch processing data is collected and grouped at a point in time and passed to the data store. Streaming data is constantly being collected and fed into the data store. The third sub-domain (1.3) focuses on how raw data is transformed into data that can be used for ML processing. The transformation process changes the data structure. The data may also need to be clean up, de-duplicated, incomplete data managed and have it’s attributes standardised.
Once these data engineering processes are complete the data is ready for further pre-processing prior to being fed into a Machine Learning algorithm. This pre-processing is covered by the second knowledge domain Exploratory Data Analysis.
Domain 2: Exploratory Data Analysis
2.1 Sanitize and prepare data for modeling.
AWS Certified Machine Learning Speciality exam guide
Identify and handle missing data, corrupt data, stop words, etc.
Formatting, normalizing, augmenting, and scaling data
Labeled data (recognizing when you have enough labeled data and identifying mitigation
strategies [Data labeling tools (Mechanical Turk, manual labor)])
2.2 Perform feature engineering.
Identify and extract features from data sets, including from data sources such as text, speech,
image, public datasets, etc.
Analyze/evaluate feature engineering concepts (binning, tokenization, outliers, synthetic
features, 1 hot encoding, reducing dimensionality of data)
2.3 Analyze and visualize data for machine learning.
Graphing (scatter plot, time series, histogram, box plot)
Interpreting descriptive statistics (correlation, summary statistics, p value)
Clustering (hierarchical, diagnosing, elbow plot, cluster size)
In this domain the data is analysed so it can be understood and cleaned up. It comprises 24% of the exam marks. There are three sub-domains:
Analyzing and visualizing the data (sub-domain 2.3) overlaps with the other two sub-domains which use these techniques. The techniques include graphs, charts and matrices. Before you can sanitize and prepare data (sub-domain 2.1) you have to understand the data. This is done using statistics that focus on specific aspects of the data and graphs and charts that allow relationships and distributions to be seen. The data can then be cleaned up using techniques to remove distortions and fill in gaps. Feature Engineering (sub-domain 2.2) is about creating new features from existing ones to make the ML algorithms more powerful. Techniques are used to reduce the number of features and categorise the data.
You now understand your data and have cleaned it up ready for the next stage, modeling.
Domain 3: Modeling
3.1 Frame business problems as machine learning problems.
AWS Certified Machine Learning Speciality exam guide
Determine when to use/when not to use ML
Know the difference between supervised and unsupervised learning
Selecting from among classification, regression, forecasting, clustering, recommendation, etc.
3.2 Select the appropriate model(s) for a given machine learning problem.
Xgboost, logistic regression, K-means, linear regression, decision trees, random forests, RNN,
CNN, Ensemble, Transfer learning
Express intuition behind models
3.3 Train machine learning models.
Train validation test split, cross-validation
Optimizer, gradient descent, loss functions, local minima, convergence, batches, probability.
Compute choice (GPU vs. CPU, distributed vs. non-distributed, platform [Spark vs. non-Spark])
Model updates and retraining
Batch vs. real-time/online
3.4 Perform hyperparameter optimization.
Regularization; Drop out; L1/L2; Cross validation
Model initialization
Neural network architecture (layers/nodes), learning rate, activation functions
Tree-based models (# of trees, # of levels)
Linear models (learning rate)
3.5 Evaluate machine learning models.
Avoid overfitting/underfitting (detect and handle bias and variance)
Metrics (AUC-ROC, accuracy, precision, recall, RMSE, F1 score)
Confusion matrix
Offline and online model evaluation, A/B testing
Compare models using metrics (time to train a model, quality of model, engineering costs)
Cross validation
When people talk about Machine Learning they are mostly thinking about Modeling. Modeling is selecting and testing the algorithms to process data to find the information of value. It comprises 36% of the exam marks. This domain has five sub-domains:
- 3.1 Frame the business problem
- 3.2 Select the appropriate models
- 3.3 Train the models
- 3.4 Tune the models
- 3.5 Evaluate the models
(sub-domain 3.1) Firstly decide if ML is appropriate for this problem. ML is good for data driven problems involving large amounts of data where the rules cannot easily be coded. The business problem can probably be framed in many ways and this determines what kind of ML problem is being solved. For example the business problem could be framed to require a yes/no answer as in fraud detection, or a value as in share price.
Many models (sub-domain 3.2) are available through AWS Machine Learning services. Each model has it’s own use cases and requirements. Once the model has been chosen an iterative process of training, tuning and evaluation is undertaken.
Model training (sub-domain 3.3) is the process of providing a model with data to learn from. During model training the data is split into three parts. Most (70% to 80%) is used as training data with the remainder used for validation and testing.
Model tuning (sub-domain 3.4) is also known as hyperparameter optimisation. Hyperparameters are SageMaker settings that do not change during training. They can be tuned manually, using search methods and automatically by using SageMaker guided search. Model tuning also includes additional feature engineering and experimenting with new algorithms.
Model evaluation (sub-domain 3.5) is used to find out how well a model will do in predicting the desired outcome. This is done using metrics to measure the performance of the Model. Metrics measure accuracy, precision and other features of the Model by comparing the results from the model with the known contents of the training data.
Your model is now ready to be used with real data. But before it can be let loose on your corporate data it has to be deployed into the production environment.
Domain 4: Machine Learning Implementation and Operations
4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault
AWS Certified Machine Learning Speciality exam guide
tolerance.
AWS environment logging and monitoring: CloudTrail and CloudWatch; Build error monitoring
Multiple regions, Multiple AZs
AMI/golden image
Docker containers
Auto Scaling groups
Rightsizing: Instances; Provisioned IOPS; Volumes
Load balancing
AWS best practices
4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
ML on AWS (application services): Poly; Lex; Transcribe
AWS service limits
Build your own model vs. SageMaker built-in algorithms
Infrastructure: (spot, instance types), cost considerations; Using spot instances to train deep learning models using AWS Batch
4.3 Apply basic AWS security practices to machine learning solutions.
IAM
S3 bucket policies
Security groups
VPC
Encryption/anonymization
4.4 Deploy and operationalize machine learning solutions.
Exposing endpoints and interacting with them
ML model versioning
A/B testing
Retrain pipelines
ML debugging/troubleshooting: Detect and mitigate drop in performance; Monitor performance of the model
This is about productionisation and related DevOps skills to make everything work in production. It comprises 20% of the exam marks. There are four sub-domains:
- 4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
- 4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
- 4.3 Apply basic AWS security practices to machine learning solutions.
- 4.4 Deploy and operationalize machine learning solutions.
Building highly available fault tolerant systems relies on separating components of a system into a loosely coupled distributed system. This ensures that failure in one part of the system is less able to effect other parts of the system. AWS services and features then enable decoupling are SQS, CloudWatch, CloudTrail and SageMaker Notebook end points.
Scalability is the property of a system to automatically provision more resources when needed and to scale back those resources to reduce waste when demand is low. AWS services and features that enable scalability are Autoscaling and containerised ML models, which are Docker images.
Conclusion
The AWS Certified Machine Learning – Speciality, exam readiness course prepares you for the exam. It is good for outlining the breadth of the course and how it is divided up into four domains and fifteen sub-domains. Whilst it lists and mentions many subjects only a few are described in any detail and it is still a bit light with those. I suggest this is the first thing you should study when you start preparing for the exam.
Credits
Photo by Daniel Gonzalez on Unsplash