Image of a child under three years old reading a fruit alphabet book to symbolize Unsupervised Learning

Unsupervised Learning for Machine Learning

What is Unsupervised Learning?

Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Unsupervised Learning is used to infer patterns in unlabeled datasets. The algorithms can detect hidden patterns and data groupings in data without help from humans through labeling. Unsupervised learning is ideal for exploring raw and unknown data.

Uses of Unsupervised learning

  • Exploratory data analysis to visualize patterns in data.  Unsupervised learning may be used for data exploration prior to identifying opportunities for Supervised Learning.
  • Cross selling marketing strategies
  • Customer segmentation
  • Image recognition

Questions

Scroll to the bottom of the page for questions and answers.

What are the advantages of unsupervised learning?

The big advantage of Unsupervised Learning is that it does not need labeled data. Labeling large datasets can be very expensive especially if it requires human interaction to perform the labeling. Labeling is also difficult when the number or identity of the classes may not be known, or may be expected to change over time. This freedom from the burden of labeling means that Unsupervised Learning techniques can be used on new or changing data.

What types of Unsupervised Learning are there?

Unsupervised Learning techniques and algorithms can be divided into two groups:

  1. Probabilistic techniques
  2. Neural Networks

Unsupervised learning is part of sub-domain 3.1, Frame business problems as machine learning problems, which is in domain 3, Modeling. A description of all the knowledge domains in the exam is in these revision notes: AWS Machine Learning exam syllabus

Probabilistic Unsupervised Learning

What probabilistic techniques are there in Unsupervised Learning?

Probabilistic techniques are statistical methods that are used to examine unlabeled data to find groups and similarities. This also enables data that is dissimilar to be identified. Common techniques for unsupervised learning are:

  • Clustering
  • Topic modeling
  • Embeddings
  • Anomaly detection
  • Dimensionality Reduction

Video: Amazon SageMaker’s Built-in Algorithm Webinar Series: Clustering with K Means

This video by Kris Skrinak is quite long, however only the first 8.18 minutes are of specific relevance to this study guide. I recommend you watch the first 8.18 minutes now and come back to it later when you have completed revising the SageMaker K-Means built in algorithm.

Clustering

What is Clustering in Unsupervised Learning?

Clustering or Cluster Analysis is a group of techniques used to collect objects into groups based on features or attributes, these groups are called Clusters. A Cluster is a group of objects that have features that are similar to each other and dis-similar to the features of other clusters.

The features that determine if an object is a member of a cluster may be simply present or absent, or they may form part of a spectrum. In this case the members of a Cluster will have features close to each other and distant from members in other Clusters. This can be described as distance or proximity.

The advantage of Clustering is that it provides an insight to the structure of the data. This can be used to formulate a Supervised Learning strategy.

What types of Clustering are there?

There are different types of clustering including:

  • Exclusive clustering
  • Overlapping clustering
  • Hierarchical clustering
  • Probabilistic clustering

What SageMaker algorithms support Clustering?

Topic Modeling

What is Topic Modeling in Unsupervised Learning?

Topic modeling is used to analyze text based data such as documents, emails, text messages etc. It uses clustering techniques for discovering topics that occur in a collection of documents. Algorithms are chosen to extract topics that are clear, segregated and meaningful. The output of Topic Modeling is a list of topics, which may be words or phrases, and groups of documents identified as containing topics from the list.

What is the advantage of Topic Modeling?

The advantage of Topic Modeling is that it does not need labeled data. This means that it can be rapidly deployed on incoming data and change the topics it identifies as the input data changes.

What are the applications of Topic Modeling?

The applications of Topic Modeling are:

  • Sentiment Analysis –  Topic Modeling can be used to identify what the users sentiment was from the message they sent. This can be used for marketing feedback to analyse how people felt about a product or service. 
  • Opinion summarization –  This is a type of Sentiment Analysis to relate people’s opinions to specific documented policies, or processes.
  • Help desk / call center automation –  by using Topic Modeling the topics of the request for assistance can be identified and the call routed to the most appropriate response.
  • Chat bots can use Topic Modeling to identify what the user wants the chatbot to do and match this with documents and resources that the user needs. 
  • Question and answer – Many websites have document search features to encourage user self help. The topics can be extracted from the question and be matched to the appropriate help pages which can be ranked in order of the match to the identified topic.
  • Spam filters – Using Topic Modeling text based messaging systems such as email and text messages can be scanned to identify messages likely to be spam by the topics they contain. These can then be filtered out to a spam folder, or removed entirely.
  • Bioinformatics – This is an area of current research. Biological research can produce vast quantities of data. Topic Modeling can be used to identify biological topics for further investigation.

What SageMaker algorithms support Topic Modeling?

Embeddings

What are Embeddings in Unsupervised Learning?

Embeddings are a way of learning about an object once and then saving the learning in a form that allows it to be embedded in other learning models. So Embeddings are a way of reusing the results of prior training in other models.

The definition of embeddings provided by the Google Machine Learning Crash Course is:

“An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.”

Machine Learning Crash Course, Google
  • Google Machine Learning crash course video: Embeddings

An example of Embeddings is understanding words and phrases. However the objects for Embeddings are not restricted to only text.

What are the advantages of Embeddings?

By reusing inferences created by prior learning efforts the Model will work more efficiently than if it had to learn these inferences by itself. Also the Embeddings may be of highly quality if they were created using superior training data. 

What SageMaker algorithms support Embeddings?

Anomaly Detection

What is Anomaly Detection in Unsupervised Learning?

Anomaly detection is used to identify patterns that do not conform to expected or usual behavior. This is defined more specifically in the Google article Use real-time anomaly detection reference patterns to combat fraud as:

“Anomaly detection—or in broader terms, outlier detection—allows businesses to identify and take action on changing user needs, detect and mitigate malignant actors and behaviors, and take preventive actions to reduce costly repairs.”

Use real-time anomaly detection reference patterns to combat fraud, Shan Kulandaivel & Cody Irwin, Google

What is Anomaly Detection used for?

  • Network Traffic monitoring
  • Anti fraud
  • Security analytics
  • Fault detection in operating environments

What types of Anomaly are there?

Anomalies can be described in three general groups:

  • Point Anomalies
  • Contextual Anomalies
  • Collective Anomalies

Point Anomalies are where a single instance of data is anomalous if it’s too unlike the rest. An example is detecting financial fraud based on the amount spent in one transaction. In Contextual Anomalies the event could be normal in some circumstances, but not in others. For example high spend on alcohol may usually occur on weekends, but not Monday to Thursday; multiple purchases of high denomination gift vouchers outside the Christmas season. With Collective Anomalies we are looking for a series of events that together are unusual. For example, changes to security permissions, moving large files, downloading to a location outside the organisation may indicate a cyber attack.

What is the advantage of Anomaly Detection?

The advantage of Anomaly Detection is that it enables the detection of events that have not yet been identified as being detrimental, or advantageous. This means that you can identify events of interest that have never happened before allowing an organisation to take early action. Rapid detection allows for rapid mitigation to reduce the cost of the Anomaly or to take advantage of it. Because the detection is automatic it eliminates the manual burden of identifying unusual activity.

What SageMaker algorithms support Anomaly Detection?

Dimensionality Reduction

What is Dimensionality Reduction in Unsupervised Learning?

Dimensionality Reduction is concerned with reducing the number of features. For example by reducing a feature set from fifty to twenty columns. Dimensionality Reduction techniques are used to reduce the dimension of the feature set, so that the new feature set retains some meaningful properties of the original data.

Dimensionality is discussed in detail in these revision notes: See Feature Engineering (Sub-domain 2.2)

What SageMaker algorithms support Dimensionality Reduction?

Unsupervised Learning using Neural Networks

What are Neural Networks?

Neural Networks are algorithms that mimic the human brain’s capacity to learn. Neural Networks are made up of layers of neurons, or nodes, that are interlinked in a network. The advantage of Neural Networks is that once they are trained they can make inferences very fast and with great accuracy.

How do Neural Networks work?

Neural Networks are made up of layers of nodes organised into an input layer, an output layer and multiple hidden layers. Each node is linked to all of the nodes in the next layer. A node contains the maths that drives the Neural Networks and determines what happens when the input is received. The outcome will be to pass information on to the next layer, or make no response. This is the part of the Neural Network that can be trained.

Video: Neural Networks and Deep Learning: Crash Course AI #3

A gentle introduction by Jabril from Crash Course AI. This video does not contain much maths. Don’t be put off by the cringy props that wouldn’t amuse an eight year old. This video is worth watching and is only 12.22 minutes long.

What types of Neural Networks are there?

There are many different types of Neural Networks which differ in three ways:

  • The shape, or topology of the network
  • The types of node or cells in the network
  • The linking between the nodes

This article has a good image summarising many of them: 

How does SageMaker support Neural Networks?

SageMaker supports Neural Networks with it’s Deep Learning features and services.

Deep Learning

What is Deep Learning in Unsupervised Learning?

Deep Learning is another name for large Neural Networks. The depth refers to the number of layers in the network. However there is no agreed definition of what if Deep Learning.

What are the advantages of Deep Learning?

The advantages of Deep Learning are:

  • Speed – Neural Networks are designed for speed.
  • Scalability –  because Neural Networks are fast they can handle large quantities of unlabeled data.
  • Flexibility – There are a range of Deep Learning frameworks and interfaces available for a range of use cases.

Example applications of Deep Learning

  • Computer vision
  • Speech recognition
  • Natural Language Processing
  • Recommendation engines

Deep Learning frameworks and interfaces are discussed further in: Machine Learning services and features.

Reinforcement Learning

What is Reinforcement Learning in Unsupervised Learning?

AWS defines Unsupervised Reinforcement Learning (RL) as:

“In unsupervised learning, an agent learns by uncovering structure within unlabeled data. While a RL agent might benefit from uncovering structure based on its experiences, the sole purpose of RL is to maximize a reward signal.”

Use Reinforcement Learning with Amazon SageMaker, Amazon Webservices

Reinforcement learning draws it’s heritage from research into the psychology of learning processes of real animals and people. In the 1980s it was combined with computational  research to produce the Reinforcement Learning used in Machine Learning today.

What are the advantages of Reinforcement Learning?

The advantages of Reinforcement Learning are:

  • Maximizes Performance
  • Sustain Change for a long period of time
  • Can be used on new or changing data

How does Reinforcement Learning work?

In Reinforcement Learning an Agent (a program) seeks out answers to problems and receives a reward when the problem is solved. Next time the Agent has the same or similar problem the same solution is tried again and so leads more rapidly to the reward. To do this the Agent has learned how to get the reward. The Agent seeks out actions to receive the reward and stops performing actions that do not lead to the reward. So the learning is reinforced by the reward.

Most Reinforcement Learning models can be described as Markov Decision Processes (MDP). A Markov Decision Process comprises one or more Episodes. Each Episode is made up of one or more Time Steps. Each Time Step has the following features:

  • Environment
  • State -This is the current situation and environment which is relevant to what Actions the Agent can take next.
  • Action – What the Agent does.
  • Reward – This is the Reward that may be received for a successful Action.
  • Observation – This is what the Agent can perceive of the Environment

How does SageMaker support Reinforcement Learning?

In SageMaker, Reinforcement Learning (RL) is implemented using SageMaker supported Deep Learning frameworks, interfaces and toolkits. The supported Deep Learning frameworks are TensorFlow and Apache MxNet. The SageMaker RL Toolkit manages interactions between the Agent and the environment and also provides RL algorithms. SageMaker also supports the Intel Coach and Ray RLlib toolkits. There is a wide range of environments available and SageMaker supports the Open AI Gym Interface.

Video: What is Reinforcement Learning? | AI 101

This video by Jordan Harrod is 10.47 minutes long.

Summary

Unsupervised Learning opens the possibilities of finding out things about your data that you never knew were there. This is due to the potential of unlabeled data where the endgame may not be fully known. The two types of Unsupervised Learning are Probabilistic and Neural Networks. The Probabilistic methods are based on statistics and are supported in SageMaker by SageMaker built in algorithms. Neural Networks are built using SageMaker Deep Learning features and services which include the frameworks TensorFlow and Apache MxNet. SageMaker also supports Reinforcement Learning in this way.

Credits

Photo by Phong Duong on Unsplash


AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


Questions and answers

15
Created on By Michael Stainsbury

3.1 Unsupervised Learning for Machine Learning (full)

These test questions are part of sub-domain 3.1, Frame business problems as machine learning problems, which is in the Modeling knowledge domain.

1 / 5

What are the two groups that unsupervised learning techniques and algorithms can be divided into?

2 / 5

3 / 5

The aim of a Reinforcement Learning agent is to maximize a <–?–> signal.

4 / 5

Neural Networks are made up of layers of <–?–> that are interlinked in a network.

5 / 5

Unsupervised Learning is the machine learning task of inferring a function to describe hidden structure from <–?–> data.

Your score is

The average score is 48%

0%


Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

static image of cv library ad showing a blue owl and the text looking for you next job? Register cv
Reviews
CV Library

If you want to land your dream AWS job you have to do more than just dream about it you need a CV. Agents may call, email or text and job ads pop up on every site you visit but the first thing they will ask for is a copy of your CV. A CV…


Similar Posts