Glasses and book to symbolize Data Visualization

Data visualization for Machine Learning

This Study Guide is about graphs used for data visualization which is subdomain 2.3: Analyze and visualize data for machine learning of the AWS Machine Learning certificate Speciality exam. Whilst there are many types of graphs and charts that can be used to display data, a few are of particular importance for Machine Learning. They are used as part of exploratory data analysis to understand the data so that it can be prepared for Machine Learning. There are eight graphs listed in the AWS Exam Readiness course.

Questions

To confirm your understanding scroll to the bottom of the page for 10 questions and answers.

Most of the graphs are known by more than one name. However these are the names used by AWS and will probably be used in the exam.

These questions are answered for each graph:

  • What the graph shows.
  • How the graph is made. This includes an example using free tools on the internet and public domain datasets.
  • When to use the graph.
  • The strengths and weaknesses of the graph.
  • Further reading. These are useful internet resources that provide more information about the graph.

After each type of graph has been discussed, their use is compared in a table that identifies their suitability to answer common questions when preparing data for machine learning.

  1. Comparing values
  2. Showing the composition of something
  3. Understanding the distribution of the data
  4. Showing trends
  5. Showing the relationships between variables
Further reading

Line Graphs

Strawberries - Strawberry data used to show line graphs

What Line Graphs show

A typical line graph will have continuous data along both the vertical (y-axis) and horizontal (x-axis) dimensions. The y-axis usually shows the value of whatever variable we are measuring; the x-axis is most often used to show when we measured it, for example chronologically. You can show four or five features on a Line Graph. More than that may be difficult to see and understand. This allows you to compare the values.

Line graphs are also known as time series charts, line chart, line plot or curve chart.

How to make Line Graphs

We all know how to make line graphs, it’s something we have been doing since sixth grade. So here I will discuss how to use a ubiquitous visualization tool, Google Sheets, to have a quick look at our data. The data will have to be small, no more than 40,000 rows, clean and well behaved. Here is a page listing other limitations:

This is a Line Graph showing UK Strawberry imports in 1000 metric tons by year. The data came from UK government National Statistics – Latest horticulture statistics.

Line graph using uk Strawberry imports

It should be really quick to make a graph like this, but there are a couple of gotchas that will leave you searching Google for answers. Google Sheets expects the data to be displayed vertically, for example a column of year dates with the tons of strawberries in the next column. If the data isn’t you must select the switch rows/columns check box. The Strawberry weights on the vertical scale are entered in the series field and the year labels on the horizontal scale are entered into the x-axis field. The range field is auto-populated from these two values. Select use row as labels. Here is a screenshot of the settings:

Line graph settings UK strawberry import

When to use Line Graphs

Use Line Graphs when you want to show value changes over time. You can display several features as they change over time on the same graph. Other continuous variables can be used instead of time.

What are the strengths and weaknesses of Line Graphs

Strengths:

  • Simple to read and understand
  • Showing changes and trends over time
  • Compare trends in different groups of a variable
  • Highlighting anomalies within and across data serie and allows error values to be identified

Weaknesses:

  • Can be difficult to plot and compare multiple features on the same graph.
  • Displaying quantities of things
  • Working with categorical data
  • Making part-to-whole comparisons
  • Showing sparse data sets
Further reading

Scatter Plots

Row of five apples - Apple data used to show scatter plots

What Scatter Plots show

A Scatter Plot shows the relationship between two numerical variables plotted simultaneously along both the horizontal and vertical axis. Often Scatter Plots to test the correlation between two variables. Scatter Plots show the relationship between two things, but it’s not uncommon to display more than two dimensions, especially when exploring your data.

Scatter Plots are also known as scatter diagrams and X-Y graphs.

How to make Scatter Plots

The data used for this Scatter Plot comes from UK Government Department for Environment, Food & Rural Affairs wholesale fruit and vegetable prices. This dataset can be downloaded at no charge. It comprises of weekly prices for popular fruit and vegetables going back five years. I took the previous three months and manually extracted the records for Apples.

Because the data was in a vertical list it was easier to get the Scatter plot to recognise it. The x-axis field was taken from the column of dates and the series fields was the price of fruit per Kg. The Scatter Plot is a great way to see the shape and distribution of the data. This data appears to have a lot of outliers, whilst most of the data points are clumped at the bottom. This is where understanding your data becomes important. Whilst this data is for apples it consists of different varieties of apples with some quite different prices that cause the data distribution you can see.

A Scatter plot showing weekly prices for UK Apples

When to use Scatter Plots

Scatter Plots are used to show the relationship, or correlation, between two variables where one is dependent and one independent. A Scatter Plot will also show the presence of outliers in the data. Different colours can be used to display more than two variables.

What are the strengths and weaknesses of Scatter Plots

Strengths:

  • Good at showing data correlation and relationships
  • Method of illustration non-linear patterns
  • Shows the spread of data including outliers

Weaknesses:

  • Difficult to show individual data point values
  • Difficult to show relationship between more than two variables at once
Further reading

Box Plots

A basket of Broad beans, French beans and Runner beans - green beans data used to show box plots

What Box Plots show

This Box Plot is used to show the shape of the distribution, the central value, and the distribution’s variability. This is done by displaying a summary of a large amount of data in five attributes:

  1. minimum
  2. median of the upper quartile (first quartile)
  3. median
  4. median of the lower quartile (lower quartile)
  5. maximum

Box Plots also show outliers. Box Plots are also known as Box and Whisker plots.

How to make Box Plots

To get data to show Box Plots I used the vegetable prices data again, this time for green beans. There is data for three varieties of beans: Broad beans, French beans and Runner beans. I sorted the records in Google Sheets to group the bean data together and then used the Function feature to summarise the data into five metics. This table show the Function format for each of the five metrics:

DescriptionFunction exampleLocation of Function
Minimum=MIN(D1:D29)Insert => function => MIN
Median of the upper quartile, the first quartile=QUARTILE(D1:D29,1)Insert => function =>Statistical=>QUARTILE
Median=MEDIAN(D1:D29)Insert => function =>Statistical=>MEDIAN
Median of the lower quartile, the lower quartile=QUARTILE(D1:D29,3)Insert => function =>Statistical=>QUARTILE
Maximum=MAX(D1:D29)Insert => function => MAX

So all the data is summarised into just these five values. The data we are using is small compared to data that is usually used for Machine Learning, but even big data will still be summarised down to just these five values. This is what they look like when plotted:

box plot with green beans UK prices data

I used this online box plot plotter:

Each of the groups in turn represent Broad beans, French beans and Runner beans.

When to use Box Plots

Box Plots are used to summarise the distribution of numeric variables. They allow each attribute’s distribution to be reviewed and outliers to be observed.

What are the strengths and weaknesses of Box Plots

Strengths:

  • Can present a clear summary of a large amount of data.
  • Show outliers.

Weaknesses:

  • Exact data point values are not displayed.
Further reading

Histograms

A bowl of blueberries - Blueberry data used to show Histograms

What Histograms show

Histograms show the distribution of data for one or two variables. Whilst up to four variables can be shown the graph can become unreadable if more than one, to two are shown. The visual shape of the data that is shown can be used to describe the data’s distribution, for example:

  • Gaussian distribution
  • Exponential distribution
  • Skewed distribution

The Histogram can also show rounding errors and outliers.

How to make Histograms

I used the UK horticultural data again and using Google Sheets extracted data for Blueberries. Sheets has a Histogram chart which it calls a Column Chart. When I first displayed the data it was simply full of vertical bars showing some variation between them. The data did not contain records for weeks where no blueberries were grown. So I added in extra zero value records for the missing weeks. The graph you see below now shows that in the UK Blueberries are harvested from late June to November with no locally grown Blueberries available from December to early June. So we now know when it is a good time to export Blueberries to the UK.

Histogram showing weekly UK blueberry prices

When to use Histograms

Histograms are used to understand the distribution of numeric data. The data is divided up into bins and we get a total count for each bin. This allows outliers to be identified.

What are the strengths and weaknesses of Histograms

Strengths:

  • A histogram allows you to see the frequency distribution of a data set. It offers an “at a glance” picture of a distribution pattern.
  • Can show categorical data.

Weaknesses:

  • Because the data is grouped into bins some information can be obscured.
  • Shows more than two sets of data poorly.
Further Reading

Scatter Matrix

Coffee and food on a tray - Coffee and food data used to show scatter matrix

What Scatter Matrices show

Scatter Matrices show how much variables are affected by each other and their relationship, if any, between them. Also known as scatterplot matrix, correlogram and pairplot.

How to make Scatter Matrices

Below is a scatter matrix from data about the nutritional requirements of food from StarBucks downloaded from Kaggle. The data was sorted to remove records without data and the first three parameters were used to make the the scatt plotts below. Each plot is made from a different combination of the three parameters.

A Scatter Matrix showing nutritional information for coffee shop food

Unfortunately there is not much free support to produce Scatter Matrix. There is a paid for plugin for Microsoft Excel, but this was produced manually with Sheets by assembling Scatter Plots in a matrix form.

When to use a Scatter Matrix

To understand the relationships between three or more features in a model and the correlation between the variables. A Scatter Matrix can be used as part of a dimensionality reduction process.

What are the strengths and weaknesses of Scatter Matrices

Strengths:

  • Allows reasonably large amounts of variables to be compared in the same graph.

Weaknesses:

  • Can be difficult to see detail when ten or more variables are compared.
Further reading

Correlation matrix

Fruit in boxes - fruit data used to show correlation graph

When to use a Correlation Matrix

A Correlation Matrix is used to analyze the relationships between pairs of numeric variables. This allows all variables in a model to be analyzed for correlation with each other. The Correlation Matrix is often used for exploratory analysis of a dataset.

What a Correlation Matrix shows

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation coefficient between two variables. A correlation coefficient is a number between zero and one that shows how much two variables correlate with each other. A value of one indicates complete correlation so that as one variable changes another changes the same way. A value of zero shows there is no correlation and two the variables are completely independent of each other. The diagonal of the table is always one because a variable is always correlated with itself.

How to make Correlation Matrix

The data I used for the Correlation Matrix came from a UK government survey of the nutrient composition of fruit and vegetables, it is free to download. From this data I extracted nutrient information for some common fruit: Bananas, Apples, Green grapes, Red grapes, Strawberries, Blueberries, Oranges.

It is possible to create a Correlation Matrix in Google Sheets. The function CORREL is in the Statistical section of the Function menu. The inputs are two ranges defined by selecting the ranges from your table data. However for more than a few attributes this is a very manual and error prone method. Another way is to use the on-line tool from DisplayR. This has a free version which is fine for learning how to make different graph types.

A Correlation chart for fruit nutrition data

In this version of the Correlation Matrix the Correlation Coefficients are displayed as colours. The key is to the right. The top diagonal shows a Correlation Coefficient of 1.0 as dark blue.

What are the strengths and weaknesses Correlation Matrix

Strengths:

Good a showing the correlation between pairs of variables.

Weaknesses:

  • Does not show plots
  • Cannot see outliers or bending patterns
Further reading
Video – 8.4 – Correlation Matrix from 1 min Statistics
This video about the Correlation Matrix is 1 minute long.

Heatmaps

cut fruit in a bowl - fruit price data used to show heatmaps

When to use Heatmaps

Heatmaps are used to display a general overview of numeric data.

What Heatmaps show

A heat map is data analysis software that uses color the way a bar graph uses height and width to visualize the data. Individual values are shown by colors. Typically the columns are chosen to have a range from a light, or warm, color to a dark cold color. They are good at showing a general overview of the data and can allow clustering to be visualized.

How to make Heatmaps

The graph below shows the prices of Blackberries, Goosberries, Raspberries, Strawberries in the summer months of July, August and September. The darker the color, the higher the price. You can see that Blackberries command a premium price at the end of the summer whilst Goosberries are only available from August onwards.

A heatmap showing soft fruit summer prices

The data was extracted from UK Government Department for Environment, Food & Rural Affairs wholesale fruit and vegetable prices. This dataset can be downloaded at no charge.

What are the strengths and weaknesses of Heatmaps

Strengths:

  • Good general summary of a numerical value.
  • Colour can be used to show when a boundary value has been passed.

Weaknesses:

  • Data often needs to be normalized.
  • It can be hard to translate a color in a precise number.
  • Not good at displaying specific data points.
Further reading

Confusion matrix

Oranges in a tree - citrus plant disease data used to show confusion matrix

When to use a Confusion Matrix

A Confusion Matrix is used to check the performance of a classification model. It is also known as an error matrix.

What a Confusion Matrix shows

The Confusion Matrix shows the ways in which your classification model is confused when it makes predictions. The degree of confusion can be described with these definitions. This allows different Machine Learning models to be compared.

  • Accuracy: Overall, how often is the classifier correct?
  • Error Rate: Overall, how often is the classifier wrong?
  • True Positive Rate: When it’s actually yes, how often does it predict yes? This is also known as “Sensitivity” or “Recall”
  • False Positive Rate: When it’s actually no, how often does it predict yes?
  • True Negative Rate: When it’s actually no, how often does it predict no? This is also known as “Specificity”
  • Precision: When it predicts yes, how often is it correct?

How to make Confusion Matrix

I used results from an IJEAT paper on identifying plant diseases of Citrus fruits using Machine learning of image data. This is the Confusion Matrix for Canker:

 Predicted NOPredicted YES
Actual NOTrue Negative = 70False Positive = 2
Actual YESFalse Negative = 1True Positive = 77

https://www.ijeat.org/wp-content/uploads/papers/v9i2/B4066129219.pdf

The Confusion Matrix compares what the ML algorithm detected against the actual condition of the fruit. We can now put numeric values to the definitions listed above.

The total number of fruits examined is the total of the numbers in the matrix:

Total = 70 + 2 + 1 + 77 = 150

Accuracy: (True Positive + True Negative)/Total

( 70 + 77 ) / 150 = 0.98, or 98%

Error Rate: ( False Positive + False Negative ) / Total

( 2 + 1 ) / 150 = 0.02, or 2%

Recall Rate: True Positive / Actual Yes, also: True Positive / ( False Negative + True Positive )

77 / ( 1 + 77 ) = 0.9871, or 98.71%

False Positive Rate: False Positive / Actual No, also: False Positive / ( False Positive + True Negative )

2 / ( 2 + 70 ) = 0.0277, or 2.77%

True Negative Rate: True Negative / Actual No, also: True Negative / ( True Negative + False Positive )

70 / ( 70 + 2 ) = 0.9722, or 97.22%

Precision: True Positive / Predicted Yes, also: True Positive / ( True Positive + False Positive )

77 / ( 77 + 2 ) = 0.9746, or 97.46%

What are the strengths and weaknesses of Confusion Matrix

Strengths:

Enables the evaluation of classification models.

Weaknesses:

  • The limitations of classification accuracy and when it can hide important details.
  • Need to know true values for comparison.
Further reading
Video – Making sense of the confusion matrix

This video by Kevin Markham is based on his post: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/. The video is 35 minutes long with the first 16 minutes being the most relevant to the AWS Machine Learning exam. Here are the timestamps so you can if you want to skip forwards:

  • 0 – Simple guide to the Confusion Matrix
  • 2.22 – Layout of Confusion Matrix
  • 5.25 – True Positive
  • 5.45 – True Negative
  • 6.00 – False Positive
  • 7.00 – False Positive
  • 7.54 – Three points to remember about Confusion Matrix
    1. The values are whole number counts
    2. The terms only make sense when you have a positive class
    3. If you have more than two classes the terminology should not be used because it is confusing!
  • 10.30 – A confusion matrix with row and column totals and terminology
  • 11.25 – Accuracy
  • 12.00 – Misclassification / Error rate
  • 13.20 – True Positive rate, also called Sensitivity or Recall
  • 14.25 – False Positive rate
  • 15.00 – True Negative rate, also called Specificity
  • 16.06 – Precision
  • 17.18 – ROC curve
  • 19.00 – Answers to questions posted on the web page
This video about the Confusion Matrix is 35.24 minutes long, however the first 16 minutes is the most relevant part.

When to use which graph?

The selection of graph depends on what you want to find out or show. Here are some typical areas of analysis:

  1. Comparing values
  2. Showing the composition of something
  3. Understanding the distribution of the data
  4. Showing trends
  5. Showing the relationships between variables
 12345
Line graphsX1XXX
Scatter plots X X
Box plots  X  
HistogramsX 1XX 
Scatter matrixX X  
Correlation matrix    X
HeatmapsX XX 
Confusion matrixX   X

1 – Stacked line graphs and stacked bar charts can show composition. These types of graphs are not discussed in these revision notes.

Further reading

When to use the different types of graphs:

Summary

These revisions notes have discussed eight visualizations used to analyze data to prepare it for Machine Learning. The key features for each visualization are:

  • When to use
  • What it shows
  • How to make
  • Strengths and weaknesses
  • Further reading

At the end there is a table linking the visualizations to typical questions used when preparing data for Machine Learning

Data sources

Credits

Notes

  • The graphs discussed in these revision notes are stated in the Exam Readiness: AWS Certified Machine Learning – Specialty course.
  • Density graphs are mentioned on the AWS ML pipeline course.

AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


10 questions and answers

11
Created on By Michael Stainsbury

2.3 Data visualization for Machine Learning (full)

This test has 10 questions for sub-domain 2.3 Analyze and visualize data for machine learning from the Exploratory data analysis knowledge domain.

1 / 10

3 / 10

A <–?–> shows data composition at a single point in time.

4 / 10

<–?–> can represent values as colors?

7 / 10

8 / 10

What charts can be used to visualize comparisons?

9 / 10

The number of variables displayed in a bar chart is <–number–>

10 / 10

What charts show data distribution?

Your score is

The average score is 36%

0%


Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *