A photograph showing gingerbread men being cut out with a cookie cutter to symbolize selecting a SageMaker built-in algorithm for an appropriate problem

How to select a model for a given machine learning problem

To select a model for a given Machine Learning problem we use the information and conclusions from Framing the Problem. A Machine Learning problem can be described with four aspects:

  1. Data types and format
  2. Learning paradigm or domain
  3. Problem type
  4. Use case examples

The first aspect concerns the format and structure of the data, which could be numeric, images or text. Numeric data is often tabular. The second aspect is the learning paradigm or domain which includes supervised learning, unsupervised learning, textual analysis and image processing. The third aspect is about the type of problem, for example classification, clustering, or topic modeling. The final aspect is use cases and AWS provides sixteen use case examples which will apply to many Machine Learning problems.

This information allows us to narrow down the choices of algorithms, sometimes to a single algorithm. However there may be other factors that influence the choice of algorithm. For example, some algorithms do not perform well with sparse data. These factors and nuances are discussed in the individual algorithm pages.

These revision notes are part of subdomain 3.2 Select the appropriate model(s) for a given machine learning problem of the exam syllabus.

Questions

To confirm your understanding scroll to the bottom of the page for questions and answers.

More questions for SageMaker built-in algorithms and their uses are in this article: 35 Q & A for SageMaker built-in algorithms

Video: Built-in Machine Learning Algorithms with Amazon SageMaker – a Deep Dive

A 15.37 minute Video by Emily Webber from AWS.

What are the built-in algorithms

To select a model SageMaker has seventeen built in algorithms to choose from. These are optimised versions of common open source algorithms. Here they are listed in alphabetical order:

  1. BlazingText algorithm
  2. DeepAR Forecasting Algorithm
  3. Factorization Machines Algorithm
  4. Image Classification Algorithm
  5. IP Insights
  6. K-Means Algorithm
  7. K-Nearest Neighbors (K-NN) Algorithm
  8. Latent Dirichlet Allocation (LDA) Algorithm
  9. Linear Learner Algorithm
  10. Neural Topic Model (NTM) Algorithm
  11. Object Detection Algorithm
  12. Object2Vec Algorithm
  13. Principal Component Analysis (PCA) Algorithm
  14. Random Cut Forest (RCF) Algorithm
  15. Semantic Segmentation Algorithm
  16. Sequence-to-Sequence Algorithm
  17. XGBoost Algorithm

Video: AWS re:Invent 2020: Choose the right machine learning algorithm in Amazon SageMaker

This is a 29.53 minutes video from AWS by Denis Batalov and Alberto Danese. The timestamps are:

  • 0 – introduction
  • 1.30 – 17 built-in algorithms in Amazon SageMaker
  • 5.13 – Image classification demo
  • 10.35 – Guide for classification / regression algorithm
  • 11.30 – Amazon blog post on Linear Learner
  • 11.46  – K-Nearest Neighbor (K-NN)
  • 13.22 – Amazon blog post on K-NN
  • 13.50 – XG Boost, how it works
  • 16.00 – Getting a grasp on how XG Boost works
  • 24.00 – XG Boost as a built-in algorithm
  • 25.18 – XG Boost in Nexi
  • 28.19 – Popular frameworks
  • 28.47 – AWS Marketplace
  • 29.20 – Amazon resources for SageMaker built-in algorithms

What is the definition of a model and an algorithm

When you take one of the SageMaker built-in algorithms and train it with data you create a model, therefore:

Model = Training (an Algorithm + Data)

The four aspects of a problem used for model selection

  1. Data types and format
  2. Learning paradigm or domain
  3. Problem type
  4. Use case examples

Data types and format

SageMaker algorithms have very specific requirements for the data you train them with. So we can select a model based on the data type and form of the data the algorithm processes. This aspect allows the SageMaker built in algorithms to be split into three groups: Tabular, Text and Image.

SageMaker algorithmData types and format
DeepAR Forecasting AlgorithmTabular
Factorization Machines AlgorithmTabular
IP InsightsTabular
K-Means AlgorithmTabular
K-Nearest Neighbors (K-NN) AlgorithmTabular
Linear Learner AlgorithmTabular
Object2Vec AlgorithmTabular
Principal Component Analysis (PCA) AlgorithmTabular
Random Cut Forest (RCF) AlgorithmTabular
XGBoost AlgorithmTabular
BlazingText algorithmText
Latent Dirichlet Allocation (LDA) AlgorithmText
Neural Topic Model (NTM) AlgorithmText
Sequence-to-Sequence AlgorithmText
Image Classification AlgorithmImage
Object Detection AlgorithmImage
Semantic Segmentation AlgorithmImage
An infographic that groups the SageMaker built-in algorithms by their data types and domains
Add this revision card to your Pinterest account

Learning paradigm or domain

The learning paradigm or domain includes:

  1. Supervised learning
  2. Unsupervised learning
  3. Textual analysis
  4. Image processing

The input data domain can be used to select a model by identifying a subset of the algorithms. If the input data domain is text or images the choice is confined to three and four algorithms respectively. The Learning Paradigm also narrows the search to smaller groups of algorithms. The key factor here is if the data is labelled for Supervised Learning or unlabeled for Unsupervised Learning.

SageMaker algorithmLearning paradigm or domain
DeepAR Forecasting AlgorithmSupervised Learning
Factorization Machines AlgorithmSupervised Learning
K-Nearest Neighbors (K-NN) AlgorithmSupervised Learning
Linear Learner AlgorithmSupervised Learning
XGBoost AlgorithmSupervised Learning
IP InsightsUnsupervised Learning
K-Means AlgorithmUnsupervised Learning
Object2Vec AlgorithmUnsupervised Learning
Principal Component Analysis (PCA) AlgorithmUnsupervised Learning
Random Cut Forest (RCF) AlgorithmUnsupervised Learning
BlazingText algorithmTextual Analysis
Latent Dirichlet Allocation (LDA) AlgorithmTextual Analysis
Neural Topic Model (NTM) AlgorithmTextual Analysis
Sequence-to-Sequence AlgorithmTextual Analysis
Image Classification AlgorithmImage Processing
Object Detection AlgorithmImage Processing
Semantic Segmentation AlgorithmImage Processing
An infographic that groups the SageMaker built-in algorithms by learning paradigm or domain
Add this revision card to your Pinterest account

Problem type

The Problem Type is the type of problem with reference to the data. This aspect includes:

  • Classification
  • Regression
  • Time-series forecasting
  • Clustering
  • Topic modeling
  • Dimensionality reduction
  • Anomaly detection
  • IP anomaly detection
  • Embeddings
  • Text classification
  • Machine translation
  • Text summarization
  • Speech-to-text
  • Image and multi-label classification
  • Object detection and classification
  • Computer vision
SageMaker AlgorithmProblem type
BlazingText algorithmText classification and embedding
DeepAR Forecasting AlgorithmTime-series forecasting
Factorization Machines AlgorithmBinary/multi-class classification, Regression
Image Classification AlgorithmImage and multi-label classification
IP InsightsIP anomaly detection
K-Means AlgorithmClustering or grouping
K-Nearest Neighbors (k-NN) AlgorithmBinary/multi-class classification, Regression
Latent Dirichlet Allocation (LDA) AlgorithmTopic modeling
Linear Learner AlgorithmBinary/multi-class classification, Regression
Neural Topic Model (NTM) AlgorithmTopic modeling
Object Detection AlgorithmObject detection and classification
Object2Vec AlgorithmEmbeddings
Principal Component Analysis (PCA) AlgorithmFeature engineering: dimensionality reduction
Random Cut Forest (RCF) AlgorithmAnomaly detection
Semantic Segmentation AlgorithmComputer vision
Sequence-to-Sequence AlgorithmMachine translation
XGBoost AlgorithmBinary/multi-class classification, Regression
An infographic to show how the SageMaker built-in algorithms can be grouped depending on the problems type they solve
Add this revision card to your Pinterest account

Use case examples

SageMaker AlgorithmUse case
BlazingText algorithmAssign predefined categories to documents in a corpus of text
DeepAR Forecasting AlgorithmBased on historical data for a behavior, predict future behavior
Factorization Machines AlgorithmPredict a numeric/continuous value; Predict if an item belongs to a category
Image Classification AlgorithmLabel/tag an image based on the content of the image
IP InsightsProtect your application from suspicious users
K-Means AlgorithmGroup similar objects/data together
K-Nearest Neighbors (K-NN) AlgorithmPredict a numeric/continuous value; Predict if an item belongs to a category
Latent Dirichlet Allocation (LDA) AlgorithmOrganize a set of documents into topics (not known in advance)
Linear Learner AlgorithmPredict a numeric/continuous value; Predict if an item belongs to a category
Neural Topic Model (NTM) AlgorithmOrganize a set of documents into topics (not known in advance)
Object Detection AlgorithmDetect people and objects in an image
Object2Vec AlgorithmImprove the data embeddings of the high-dimensional objects
Principal Component Analysis (PCA) AlgorithmDrop those columns from a dataset that have a weak relation with the label/target variable. This reduces the number of features to be analyzed.
Random Cut ForRcfest (RCF) AlgorithmDetect abnormal behavior in application
Semantic Segmentation AlgorithmTag every pixel of an image individually with a category
Sequence-to-Sequence AlgorithmConvert audio files to text, Summarize a long text corpus, Convert text from one language to other
XGBoost AlgorithmPredict a numeric/continuous value; Predict if an item belongs to a category

Classifying algorithms with Learning Paradigm and Data Type

The first two problem aspects we discussed have the fewest options, four for learning paradigms and three for data types. These aspects are also the easiest to identify in Problem Framing since they are based on easily observable characteristics of the data. From the table below it can be seen that identifying if the data will require Supervised or Unsupervised learning will make a significant reduction in the number of suitable algorithms. However you will still have five algorithms to choose from in each group.

Data types and formatLearning paradigm or domain
SupervisedUnsupervisedTextImage
TabularDeep AR forecasting
Factorization Machines
K-Nearest Neighbor
Linear Learner
XG Boost
IP Insights
K-Means
PCA
Random Cut Forest
Object2Vec
Blazing Text
TextLDA
NTM
Sequence to Sequence
ImageImage Classification
Object Detection
Semantic Segmentation

Summary

SageMaker has seventeen built-in algorithms that can be used to build Machine Learning models. Four aspects can be used to select a model: Data types and format; Learning paradigm or domain; Problem type; Use case examples. Using these aspects to select appropriate algorithms will reduce choice to a small group and often to a single one.

Credits

Photo by Dari lli on Unsplash


AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


10 questions and answers

9
Created on By Michael Stainsbury

3.2 How to select a model for a given machine learning problem (Silver)

10 test questions are for subdomain 3.2 Select the appropriate model(s) for a given machine learning problem of the Modeling knowledge domain.

1 / 10

Which ones are SageMaker built in algorithms that process tabular data?

2 / 10

3 / 10

What are the SageMaker built in algorithms that can be used for Anomaly Detection?

4 / 10

What are the SageMaker built in Text processing algorithms?

5 / 10

What are the learning paradigms or domains can be used to identify a group of SageMaker built-in algorithms?

6 / 10

What are the aspects you can use to choose a SageMaker built-in algorithm?

7 / 10

8 / 10

What are the data types and formats that can be used to identify a group of SageMaker built-in algorithms?

9 / 10

Which one of these SageMaker built in algorithms processes image data?

10 / 10

What are the Image processing algorithms are:

  1. Image Classification
  2. Object Detection
  3. <–?–>
2 words left

Your score is

The average score is 64%

0%


Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *