# Data visualization for Machine Learning

This Study Guide is about graphs used for data visualization which is subdomain 2.3: *Analyze and visualize data for machine learning* of the AWS Machine Learning certificate Speciality exam. Whilst there are many types of graphs and charts that can be used to display data, a few are of particular importance for Machine Learning. They are used as part of exploratory data analysis to understand the data so that it can be prepared for Machine Learning. There are eight graphs listed in the AWS Exam Readiness course.

# Questions

To confirm your understanding **scroll to the bottom of the page for 10 questions and answers.**

Most of the graphs are known by more than one name. However these are the names used by AWS and will probably be used in the exam.

These questions are answered for each graph:

- What the graph shows.
- How the graph is made. This includes an example using free tools on the internet and public domain datasets.
- When to use the graph.
- The strengths and weaknesses of the graph.
- Further reading. These are useful internet resources that provide more information about the graph.

After each type of graph has been discussed, their use is compared in a table that identifies their suitability to answer common questions when preparing data for machine learning.

- Comparing values
- Showing the composition of something
- Understanding the distribution of the data
- Showing trends
- Showing the relationships between variables

###### Further reading

- http://www.storytellingwithdata.com/chart-guide
- https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_understanding_data_with_visualization.htm
- good diagrams: https://www.data-to-viz.com/

# Line Graphs

## What Line Graphs show

A typical line graph will have continuous data along both the vertical (y-axis) and horizontal (x-axis) dimensions. The y-axis usually shows the value of whatever variable we are measuring; the x-axis is most often used to show when we measured it, for example chronologically. You can show four or five features on a Line Graph. More than that may be difficult to see and understand. This allows you to compare the values.

Line graphs are also known as time series charts, line chart, line plot or curve chart.

## How to make Line Graphs

We all know how to make line graphs, it’s something we have been doing since sixth grade. So here I will discuss how to use a ubiquitous visualization tool, Google Sheets, to have a quick look at our data. The data will have to be small, no more than 40,000 rows, clean and well behaved. Here is a page listing other limitations:

This is a Line Graph showing UK Strawberry imports in 1000 metric tons by year. The data came from UK government National Statistics – Latest horticulture statistics.

It should be really quick to make a graph like this, but there are a couple of gotchas that will leave you searching Google for answers. Google Sheets expects the data to be displayed vertically, for example a column of year dates with the tons of strawberries in the next column. If the data isn’t you must select the *switch rows/columns* check box. The Strawberry weights on the vertical scale are entered in the *series* field and the year labels on the horizontal scale are entered into the *x-axis* field. The range field is auto-populated from these two values. Select *use row as labels.* Here is a screenshot of the settings:

## When to use Line Graphs

Use Line Graphs when you want to show value changes over time. You can display several features as they change over time on the same graph. Other continuous variables can be used instead of time.

## What are the strengths and weaknesses of Line Graphs

### Strengths:

- Simple to read and understand
- Showing changes and trends over time
- Compare trends in different groups of a variable
- Highlighting anomalies within and across data serie and allows error values to be identified

### Weaknesses:

- Can be difficult to plot and compare multiple features on the same graph.
- Displaying quantities of things
- Working with categorical data
- Making part-to-whole comparisons
- Showing sparse data sets

###### Further reading

- https://www.storytellingwithdata.com/blog/2020/3/24/what-is-a-line-graph
- https://www.data-to-viz.com/graph/line.html

# Scatter Plots

## What Scatter Plots show

A Scatter Plot shows the relationship between two numerical variables plotted simultaneously along both the horizontal and vertical axis. Often Scatter Plots to test the correlation between two variables. Scatter Plots show the relationship between two things, but it’s not uncommon to display more than two dimensions, especially when exploring your data.

Scatter Plots are also known as scatter diagrams and X-Y graphs.

## How to make Scatter Plots

The data used for this Scatter Plot comes from UK Government Department for Environment, Food & Rural Affairs wholesale fruit and vegetable prices. This dataset can be downloaded at no charge. It comprises of weekly prices for popular fruit and vegetables going back five years. I took the previous three months and manually extracted the records for Apples.

Because the data was in a vertical list it was easier to get the Scatter plot to recognise it. The x-axis field was taken from the column of dates and the series fields was the price of fruit per Kg. The Scatter Plot is a great way to see the shape and distribution of the data. This data appears to have a lot of outliers, whilst most of the data points are clumped at the bottom. This is where understanding your data becomes important. Whilst this data is for apples it consists of different varieties of apples with some quite different prices that cause the data distribution you can see.

## When to use Scatter Plots

Scatter Plots are used to show the relationship, or correlation, between two variables where one is dependent and one independent. A Scatter Plot will also show the presence of outliers in the data. Different colours can be used to display more than two variables.

## What are the strengths and weaknesses of Scatter Plots

### Strengths:

- Good at showing data correlation and relationships
- Method of illustration non-linear patterns
- Shows the spread of data including outliers

### Weaknesses:

- Difficult to show individual data point values
- Difficult to show relationship between more than two variables at once

###### Further reading

- http://www.storytellingwithdata.com/blog/2020/5/27/what-is-a-scatterplot
- https://asq.org/quality-resources/scatter-diagram
- https://www.data-to-viz.com/graph/scatter.html

# Box Plots

## What Box Plots show

This Box Plot is used to show the shape of the distribution, the central value, and the distribution’s variability. This is done by displaying a summary of a large amount of data in five attributes:

- minimum
- median of the upper quartile (first quartile)
- median
- median of the lower quartile (lower quartile)
- maximum

Box Plots also show outliers. Box Plots are also known as Box and Whisker plots.

## How to make Box Plots

To get data to show Box Plots I used the vegetable prices data again, this time for green beans. There is data for three varieties of beans: Broad beans, French beans and Runner beans. I sorted the records in Google Sheets to group the bean data together and then used the Function feature to summarise the data into five metics. This table show the Function format for each of the five metrics:

Description | Function example | Location of Function |
---|---|---|

Minimum | =MIN(D1:D29) | Insert => function => MIN |

Median of the upper quartile, the first quartile | =QUARTILE(D1:D29,1) | Insert => function =>Statistical=>QUARTILE |

Median | =MEDIAN(D1:D29) | Insert => function =>Statistical=>MEDIAN |

Median of the lower quartile, the lower quartile | =QUARTILE(D1:D29,3) | Insert => function =>Statistical=>QUARTILE |

Maximum | =MAX(D1:D29) | Insert => function => MAX |

So all the data is summarised into just these five values. The data we are using is small compared to data that is usually used for Machine Learning, but even big data will still be summarised down to just these five values. This is what they look like when plotted:

I used this online box plot plotter:

Each of the groups in turn represent Broad beans, French beans and Runner beans.

## When to use Box Plots

Box Plots are used to summarise the distribution of numeric variables. They allow each attribute’s distribution to be reviewed and outliers to be observed.

## What are the strengths and weaknesses of Box Plots

### Strengths:

- Can present a clear summary of a large amount of data.
- Show outliers.

### Weaknesses:

- Exact data point values are not displayed.

###### Further reading

- https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_understanding_data_with_visualization.htm
- https://www.data-to-viz.com/caveat/boxplot.html
- https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-eng.htm
- https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review

# Histograms

## What Histograms show

Histograms show the distribution of data for one or two variables. Whilst up to four variables can be shown the graph can become unreadable if more than one, to two are shown. The visual shape of the data that is shown can be used to describe the data’s distribution, for example:

- Gaussian distribution
- Exponential distribution
- Skewed distribution

The Histogram can also show rounding errors and outliers.

## How to make Histograms

I used the UK horticultural data again and using Google Sheets extracted data for Blueberries. Sheets has a Histogram chart which it calls a Column Chart. When I first displayed the data it was simply full of vertical bars showing some variation between them. The data did not contain records for weeks where no blueberries were grown. So I added in extra zero value records for the missing weeks. The graph you see below now shows that in the UK Blueberries are harvested from late June to November with no locally grown Blueberries available from December to early June. So we now know when it is a good time to export Blueberries to the UK.

## When to use Histograms

Histograms are used to understand the distribution of numeric data. The data is divided up into bins and we get a total count for each bin. This allows outliers to be identified.

## What are the strengths and weaknesses of Histograms

### Strengths:

- A histogram allows you to see the frequency distribution of a data set. It offers an “at a glance” picture of a distribution pattern.
- Can show categorical data.

### Weaknesses:

- Because the data is grouped into bins some information can be obscured.
- Shows more than two sets of data poorly.

###### Further Reading

- https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_understanding_data_with_visualization.htm
- https://www.data-to-viz.com/graph/histogram.html

# Scatter Matrix

## What Scatter Matrices show

Scatter Matrices show how much variables are affected by each other and their relationship, if any, between them. Also known as scatterplot matrix, correlogram and pairplot.

## How to make Scatter Matrices

Below is a scatter matrix from data about the nutritional requirements of food from StarBucks downloaded from Kaggle. The data was sorted to remove records without data and the first three parameters were used to make the the scatt plotts below. Each plot is made from a different combination of the three parameters.

Unfortunately there is not much free support to produce Scatter Matrix. There is a paid for plugin for Microsoft Excel, but this was produced manually with Sheets by assembling Scatter Plots in a matrix form.

## When to use a Scatter Matrix

To understand the relationships between three or more features in a model and the correlation between the variables. A Scatter Matrix can be used as part of a dimensionality reduction process.

## What are the strengths and weaknesses of Scatter Matrices

### Strengths:

- Allows reasonably large amounts of variables to be compared in the same graph.

### Weaknesses:

- Can be difficult to see detail when ten or more variables are compared.

###### Further reading

- https://medium.com/@raghavan99o/scatter-matrix-covariance-and-correlation-explained-14921741ca56
- https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_scatter_matrix_plot.htm
- https://dzone.com/articles/what-when-amp-how-of-scatterplot-matrix-in-python
- https://www.data-to-viz.com/graph/correlogram.html

# Correlation matrix

## When to use a Correlation Matrix

A Correlation Matrix is used to analyze the relationships between pairs of numeric variables. This allows all variables in a model to be analyzed for correlation with each other. The Correlation Matrix is often used for exploratory analysis of a dataset.

## What a Correlation Matrix shows

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation coefficient between two variables. A correlation coefficient is a number between zero and one that shows how much two variables correlate with each other. A value of one indicates complete correlation so that as one variable changes another changes the same way. A value of zero shows there is no correlation and two the variables are completely independent of each other. The diagonal of the table is always one because a variable is always correlated with itself.

## How to make Correlation Matrix

The data I used for the Correlation Matrix came from a UK government survey of the nutrient composition of fruit and vegetables, it is free to download. From this data I extracted nutrient information for some common fruit: Bananas, Apples, Green grapes, Red grapes, Strawberries, Blueberries, Oranges.

It is possible to create a Correlation Matrix in Google Sheets. The function CORREL is in the Statistical section of the Function menu. The inputs are two ranges defined by selecting the ranges from your table data. However for more than a few attributes this is a very manual and error prone method. Another way is to use the on-line tool from DisplayR. This has a free version which is fine for learning how to make different graph types.

In this version of the Correlation Matrix the Correlation Coefficients are displayed as colours. The key is to the right. The top diagonal shows a Correlation Coefficient of 1.0 as dark blue.

## What are the strengths and weaknesses Correlation Matrix

### Strengths:

Good a showing the correlation between pairs of variables.

### Weaknesses:

- Does not show plots
- Cannot see outliers or bending patterns

###### Further reading

###### Video – 8.4 – Correlation Matrix from 1 min Statistics

# Heatmaps

## When to use Heatmaps

Heatmaps are used to display a general overview of numeric data.

## What Heatmaps show

A heat map is data analysis software that uses color the way a bar graph uses height and width to visualize the data. Individual values are shown by colors. Typically the columns are chosen to have a range from a light, or warm, color to a dark cold color. They are good at showing a general overview of the data and can allow clustering to be visualized.

## How to make Heatmaps

The graph below shows the prices of Blackberries, Goosberries, Raspberries, Strawberries in the summer months of July, August and September. The darker the color, the higher the price. You can see that Blackberries command a premium price at the end of the summer whilst Goosberries are only available from August onwards.

The data was extracted from UK Government Department for Environment, Food & Rural Affairs wholesale fruit and vegetable prices. This dataset can be downloaded at no charge.

## What are the strengths and weaknesses of Heatmaps

### Strengths:

- Good general summary of a numerical value.
- Colour can be used to show when a boundary value has been passed.

### Weaknesses:

- Data often needs to be normalized.
- It can be hard to translate a color in a precise number.
- Not good at displaying specific data points.

###### Further reading

- https://www.data-to-viz.com/graph/heatmap.html
- https://www.displayr.com/how-to-create-a-heatmap-in-displayr/

# Confusion matrix

## When to use a Confusion Matrix

A Confusion Matrix is used to check the performance of a classification model. It is also known as an error matrix.

## What a Confusion Matrix shows

The Confusion Matrix shows the ways in which your classification model is confused when it makes predictions. The degree of confusion can be described with these definitions. This allows different Machine Learning models to be compared.

- Accuracy: Overall, how often is the classifier correct?
- Error Rate: Overall, how often is the classifier wrong?
- True Positive Rate: When it’s actually yes, how often does it predict yes? This is also known as “Sensitivity” or “Recall”
- False Positive Rate: When it’s actually no, how often does it predict yes?
- True Negative Rate: When it’s actually no, how often does it predict no? This is also known as “Specificity”
- Precision: When it predicts yes, how often is it correct?

## How to make Confusion Matrix

I used results from an IJEAT paper on identifying plant diseases of Citrus fruits using Machine learning of image data. This is the Confusion Matrix for Canker:

Predicted NO | Predicted YES | |
---|---|---|

Actual NO | True Negative = 70 | False Positive = 2 |

Actual YES | False Negative = 1 | True Positive = 77 |

https://www.ijeat.org/wp-content/uploads/papers/v9i2/B4066129219.pdf

The Confusion Matrix compares what the ML algorithm detected against the actual condition of the fruit. We can now put numeric values to the definitions listed above.

The total number of fruits examined is the total of the numbers in the matrix:

Total = 70 + 2 + 1 + 77 = 150

**Accuracy**: (True Positive + True Negative)/Total

( 70 + 77 ) / 150 = 0.98, or 98%

**Error Rate**: ( False Positive + False Negative ) / Total

( 2 + 1 ) / 150 = 0.02, or 2%

**Recall Rate**: True Positive / Actual Yes, also: True Positive / ( False Negative + True Positive )

77 / ( 1 + 77 ) = 0.9871, or 98.71%

**False Positive Rate**: False Positive / Actual No, also: False Positive / ( False Positive + True Negative )

2 / ( 2 + 70 ) = 0.0277, or 2.77%

**True Negative Rate**: True Negative / Actual No, also: True Negative / ( True Negative + False Positive )

70 / ( 70 + 2 ) = 0.9722, or 97.22%

**Precision**: True Positive / Predicted Yes, also: True Positive / ( True Positive + False Positive )

77 / ( 77 + 2 ) = 0.9746, or 97.46%

## What are the strengths and weaknesses of Confusion Matrix

### Strengths:

Enables the evaluation of classification models.

### Weaknesses:

- The limitations of classification accuracy and when it can hide important details.
- Need to know true values for comparison.

###### Further reading

###### Video – Making sense of the confusion matrix

This video by Kevin Markham is based on his post: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/. The video is 35 minutes long with the first 16 minutes being the most relevant to the AWS Machine Learning exam. Here are the timestamps so you can if you want to skip forwards:

- 0 – Simple guide to the Confusion Matrix
- 2.22 – Layout of Confusion Matrix
- 5.25 – True Positive
- 5.45 – True Negative
- 6.00 – False Positive
- 7.00 – False Positive
- 7.54 – Three points to remember about Confusion Matrix
- The values are whole number counts
- The terms only make sense when you have a positive class
- If you have more than two classes the terminology should not be used because it is confusing!

- 10.30 – A confusion matrix with row and column totals and terminology
- 11.25 – Accuracy
- 12.00 – Misclassification / Error rate
- 13.20 – True Positive rate, also called Sensitivity or Recall
- 14.25 – False Positive rate
- 15.00 – True Negative rate, also called Specificity
- 16.06 – Precision
- 17.18 – ROC curve
- 19.00 – Answers to questions posted on the web page

# When to use which graph?

The selection of graph depends on what you want to find out or show. Here are some typical areas of analysis:

- Comparing values
- Showing the composition of something
- Understanding the distribution of the data
- Showing trends
- Showing the relationships between variables

1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|

Line graphs | X | 1 | X | X | X |

Scatter plots | X | X | |||

Box plots | X | ||||

Histograms | X | 1 | X | X | |

Scatter matrix | X | X | |||

Correlation matrix | X | ||||

Heatmaps | X | X | X | ||

Confusion matrix | X | X |

1 – Stacked line graphs and stacked bar charts can show composition. These types of graphs are not discussed in these revision notes.

###### Further reading

When to use the different types of graphs:

- https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization
- https://www.tableau.com/learn/whitepapers/which-chart-or-graph-is-right-for-you
- good graphics: https://www.datapine.com/blog/how-to-choose-the-right-data-visualization-types/
- https://activewizards.com/blog/how-to-choose-the-right-chart-type-infographic/
- https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f

# Summary

These revisions notes have discussed eight visualizations used to analyze data to prepare it for Machine Learning. The key features for each visualization are:

- When to use
- What it shows
- How to make
- Strengths and weaknesses
- Further reading

At the end there is a table linking the visualizations to typical questions used when preparing data for Machine Learning

## Data sources

- UK government National Statistics – Latest horticulture statistics. https://www.gov.uk/government/statistics/latest-horticulture-statistics
- UK Government Department for Environment, Food & Rural Affairs wholesale fruit and vegetable prices. https://www.gov.uk/government/statistical-data-sets/wholesale-fruit-and-vegetable-prices-weekly-average
- Nutritional requirements of food from StarBucks downloaded from Kaggle: https://www.kaggle.com/starbucks/starbucks-menu.
- Plant diseases https://www.ijeat.org/wp-content/uploads/papers/v9i2/B4066129219.pdf

## Credits

- Feature photo by Dmitry Ratushny on Unsplash
- Strawberry photo by Jacek Dylag on Unsplash
- Apples photo by Stepan Babanin on Unsplash
- Blueberries photo by Joanna Kosinska on Unsplash
- Beans photo by Eniko Torneby on Unsplash
- Coffee and food photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash
- Fruit in a market photo by Yuval Yehudar on Unsplash
- Cut strawberries, blackberries and raspberries in a bowl photo by Brian McCall on Unsplash
- Orange tree photo by Philippe Gauthier on Unsplash

## Notes

- The graphs discussed in these revision notes are stated in the Exam Readiness: AWS Certified Machine Learning – Specialty course.
- Density graphs are mentioned on the AWS ML pipeline course.

#### AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)

#### 10 questions and answers

##### Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

##### Whizlabs review – AWS Certified Machine Learning Specialty

Need more practice with the exams? Check out Whizlab’s free test with 15 questions. They also have three practice tests (65 questions each) and five section tests (10-15 questions each). Money off promo codes are below. For the AWS Certified Machine Learning Specialty Whizlabs provides a practice tests, a video course and hands-on labs. These…

##### Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…