# How to select a model for a given machine learning problem

To select a model for a given Machine Learning problem we use the information and conclusions from Framing the Problem. A Machine Learning problem can be described with four aspects:

- Data types and format
- Learning paradigm or domain
- Problem type
- Use case examples

The first aspect concerns the format and structure of the data, which could be numeric, images or text. Numeric data is often tabular. The second aspect is the learning paradigm or domain which includes supervised learning, unsupervised learning, textual analysis and image processing. The third aspect is about the type of problem, for example classification, clustering, or topic modeling. The final aspect is use cases and AWS provides sixteen use case examples which will apply to many Machine Learning problems.

This information allows us to narrow down the choices of algorithms, sometimes to a single algorithm. However there may be other factors that influence the choice of algorithm. For example, some algorithms do not perform well with sparse data. These factors and nuances are discussed in the individual algorithm pages.

These revision notes are part of subdomain 3.2 *Select the appropriate model(s) for a given machine learning problem* of the exam syllabus.

# Questions

To confirm your understanding **scroll to the bottom of the page for questions and answers.**

More questions for SageMaker built-in algorithms and their uses are in this article: 35 Q & A for SageMaker built-in algorithms

- AWS: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
- AWS choose algorithm: https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html
- AWS: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
- https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html
- https://towardsdatascience.com/do-you-know-how-to-choose-the-right-machine-learning-algorithm-among-7-different-types-295d0b0c7f60

#### Video: Built-in Machine Learning Algorithms with Amazon SageMaker – a Deep Dive

A 15.37 minute Video by Emily Webber from AWS.

# What are the built-in algorithms

To select a model SageMaker has seventeen built in algorithms to choose from. These are optimised versions of common open source algorithms. Here they are listed in alphabetical order:

- BlazingText algorithm
- DeepAR Forecasting Algorithm
- Factorization Machines Algorithm
- Image Classification Algorithm
- IP Insights
- K-Means Algorithm
- K-Nearest Neighbors (K-NN) Algorithm
- Latent Dirichlet Allocation (LDA) Algorithm
- Linear Learner Algorithm
- Neural Topic Model (NTM) Algorithm
- Object Detection Algorithm
- Object2Vec Algorithm
- Principal Component Analysis (PCA) Algorithm
- Random Cut Forest (RCF) Algorithm
- Semantic Segmentation Algorithm
- Sequence-to-Sequence Algorithm
- XGBoost Algorithm

#### Video: AWS re:Invent 2020: Choose the right machine learning algorithm in Amazon SageMaker

This is a 29.53 minutes video from AWS by Denis Batalov and Alberto Danese. The timestamps are:

- 0 – introduction
- 1.30 – 17 built-in algorithms in Amazon SageMaker
- 5.13 – Image classification demo
- 10.35 – Guide for classification / regression algorithm
- 11.30 – Amazon blog post on Linear Learner
- 11.46 – K-Nearest Neighbor (K-NN)
- 13.22 – Amazon blog post on K-NN
- 13.50 – XG Boost, how it works
- 16.00 – Getting a grasp on how XG Boost works
- 24.00 – XG Boost as a built-in algorithm
- 25.18 – XG Boost in Nexi
- 28.19 – Popular frameworks
- 28.47 – AWS Marketplace
- 29.20 – Amazon resources for SageMaker built-in algorithms

#### What is the definition of a model and an algorithm

When you take one of the SageMaker built-in algorithms and train it with data you create a model, therefore:

**Model = Training (an Algorithm + Data)**

# The four aspects of a problem used for model selection

- Data types and format
- Learning paradigm or domain
- Problem type
- Use case examples

## Data types and format

SageMaker algorithms have very specific requirements for the data you train them with. So we can select a model based on the data type and form of the data the algorithm processes. This aspect allows the SageMaker built in algorithms to be split into three groups: Tabular, Text and Image.

SageMaker algorithm | Data types and format |

DeepAR Forecasting Algorithm | Tabular |

Factorization Machines Algorithm | Tabular |

IP Insights | Tabular |

K-Means Algorithm | Tabular |

K-Nearest Neighbors (K-NN) Algorithm | Tabular |

Linear Learner Algorithm | Tabular |

Object2Vec Algorithm | Tabular |

Principal Component Analysis (PCA) Algorithm | Tabular |

Random Cut Forest (RCF) Algorithm | Tabular |

XGBoost Algorithm | Tabular |

BlazingText algorithm | Text |

Latent Dirichlet Allocation (LDA) Algorithm | Text |

Neural Topic Model (NTM) Algorithm | Text |

Sequence-to-Sequence Algorithm | Text |

Image Classification Algorithm | Image |

Object Detection Algorithm | Image |

Semantic Segmentation Algorithm | Image |

## Learning paradigm or domain

The learning paradigm or domain includes:

- Supervised learning
- Unsupervised learning
- Textual analysis
- Image processing

The input data domain can be used to select a model by identifying a subset of the algorithms. If the input data domain is text or images the choice is confined to three and four algorithms respectively. The Learning Paradigm also narrows the search to smaller groups of algorithms. The key factor here is if the data is labelled for Supervised Learning or unlabeled for Unsupervised Learning.

SageMaker algorithm | Learning paradigm or domain |

DeepAR Forecasting Algorithm | Supervised Learning |

Factorization Machines Algorithm | Supervised Learning |

K-Nearest Neighbors (K-NN) Algorithm | Supervised Learning |

Linear Learner Algorithm | Supervised Learning |

XGBoost Algorithm | Supervised Learning |

IP Insights | Unsupervised Learning |

K-Means Algorithm | Unsupervised Learning |

Object2Vec Algorithm | Unsupervised Learning |

Principal Component Analysis (PCA) Algorithm | Unsupervised Learning |

Random Cut Forest (RCF) Algorithm | Unsupervised Learning |

BlazingText algorithm | Textual Analysis |

Latent Dirichlet Allocation (LDA) Algorithm | Textual Analysis |

Neural Topic Model (NTM) Algorithm | Textual Analysis |

Sequence-to-Sequence Algorithm | Textual Analysis |

Image Classification Algorithm | Image Processing |

Object Detection Algorithm | Image Processing |

Semantic Segmentation Algorithm | Image Processing |

## Problem type

The Problem Type is the type of problem with reference to the data. This aspect includes:

- Classification
- Regression
- Time-series forecasting
- Clustering
- Topic modeling
- Dimensionality reduction
- Anomaly detection
- IP anomaly detection
- Embeddings
- Text classification
- Machine translation
- Text summarization
- Speech-to-text
- Image and multi-label classification
- Object detection and classification
- Computer vision

SageMaker Algorithm | Problem type |

BlazingText algorithm | Text classification and embedding |

DeepAR Forecasting Algorithm | Time-series forecasting |

Factorization Machines Algorithm | Binary/multi-class classification, Regression |

Image Classification Algorithm | Image and multi-label classification |

IP Insights | IP anomaly detection |

K-Means Algorithm | Clustering or grouping |

K-Nearest Neighbors (k-NN) Algorithm | Binary/multi-class classification, Regression |

Latent Dirichlet Allocation (LDA) Algorithm | Topic modeling |

Linear Learner Algorithm | Binary/multi-class classification, Regression |

Neural Topic Model (NTM) Algorithm | Topic modeling |

Object Detection Algorithm | Object detection and classification |

Object2Vec Algorithm | Embeddings |

Principal Component Analysis (PCA) Algorithm | Feature engineering: dimensionality reduction |

Random Cut Forest (RCF) Algorithm | Anomaly detection |

Semantic Segmentation Algorithm | Computer vision |

Sequence-to-Sequence Algorithm | Machine translation |

XGBoost Algorithm | Binary/multi-class classification, Regression |

## Use case examples

SageMaker Algorithm | Use case |

BlazingText algorithm | Assign predefined categories to documents in a corpus of text |

DeepAR Forecasting Algorithm | Based on historical data for a behavior, predict future behavior |

Factorization Machines Algorithm | Predict a numeric/continuous value; Predict if an item belongs to a category |

Image Classification Algorithm | Label/tag an image based on the content of the image |

IP Insights | Protect your application from suspicious users |

K-Means Algorithm | Group similar objects/data together |

K-Nearest Neighbors (K-NN) Algorithm | Predict a numeric/continuous value; Predict if an item belongs to a category |

Latent Dirichlet Allocation (LDA) Algorithm | Organize a set of documents into topics (not known in advance) |

Linear Learner Algorithm | Predict a numeric/continuous value; Predict if an item belongs to a category |

Neural Topic Model (NTM) Algorithm | Organize a set of documents into topics (not known in advance) |

Object Detection Algorithm | Detect people and objects in an image |

Object2Vec Algorithm | Improve the data embeddings of the high-dimensional objects |

Principal Component Analysis (PCA) Algorithm | Drop those columns from a dataset that have a weak relation with the label/target variable. This reduces the number of features to be analyzed. |

Random Cut ForRcfest (RCF) Algorithm | Detect abnormal behavior in application |

Semantic Segmentation Algorithm | Tag every pixel of an image individually with a category |

Sequence-to-Sequence Algorithm | Convert audio files to text, Summarize a long text corpus, Convert text from one language to other |

XGBoost Algorithm | Predict a numeric/continuous value; Predict if an item belongs to a category |

# Classifying algorithms with Learning Paradigm and Data Type

The first two problem aspects we discussed have the fewest options, four for learning paradigms and three for data types. These aspects are also the easiest to identify in Problem Framing since they are based on easily observable characteristics of the data. From the table below it can be seen that identifying if the data will require Supervised or Unsupervised learning will make a significant reduction in the number of suitable algorithms. However you will still have five algorithms to choose from in each group.

# Summary

SageMaker has seventeen built-in algorithms that can be used to build Machine Learning models. Four aspects can be used to select a model: Data types and format; Learning paradigm or domain; Problem type; Use case examples. Using these aspects to select appropriate algorithms will reduce choice to a small group and often to a single one.

##### Credits

#### AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)

#### 10 questions and answers

##### Whizlabs review – AWS Certified Machine Learning Specialty

Need more practice with the exams? Check out Whizlab’s free test with 15 questions. They also have three practice tests (65 questions each) and five section tests (10-15 questions each). Money off promo codes are below. For the AWS Certified Machine Learning Specialty Whizlabs provides a practice tests, a video course and hands-on labs. These…

##### Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

##### Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…