# Feature Engineering for Machine Learning

Look at the photograph above. It shows vegetables in a grocery store. We can think of these vegetables as being features of the grocery store. There are hundreds of vegetables. If we listed all the vegetables one by one it would be a very long list. How could we shorten the list? One way would be to group them in groups of similar vegetables and express them by their number and vegetable name, for example sixty courgettes, 9 cabbages. Another way would be to weigh the vegetables for example twenty pounds of aubergines, six pounds of cauliflowers. Alternatively you could categorise them by colour and the amount of visible surface area covered in square feet. All of these methods are examples of Feature Engineering. They are ways to make something very large and complicated easier to understand and process.

Feature Engineering is sub-domain 2.2 of the Exploratory Data Analysis knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus

## Questions

To confirm your understanding **scroll to the bottom of the page for 10 questions and answers.**

## Feature Engineering overview

In Machine Learning a feature is an individual measurable property of what is being explored. Feature Engineering is the process of creating new features from the original ones to make the prediction power of the chosen algorithm more powerful. The overall purpose of Feature Engineering is to show more information about our data. To do this Feature engineering seeks to:

- Prepare a dataset to be compatible with the requirements of the Machine Learning algorithm.
- Improve the performance of the chosen Machine Learning models.

Feature Engineering consists of three processes, two involving Dimensionality Reduction and one for Feature Creation and Transformation. The processes are:

- Feature Extraction (Dimensionality Reduction)
- Feature Selection (Dimensionality Reduction)
- Feature Creation and Transformation

## Dimensionality Reduction

Dimensionality Reduction is concerned with reducing the number of features. For example by reducing a feature set from fifty to twenty columns. Dimensionality Reduction techniques are used to reduce the dimension of the feature set, so that the new feature set retains some meaningful properties of the original data. There are two ways to perform Dimensionality Reduction:

- Feature Selection
- Feature Extraction

Feature Selection reduces the feature space by removing features. The problem is that by removing features you remove the opportunity to benefit from that information.

Feature extraction increases the accuracy of learned models by extracting features from the input data. This reduces the dimensionality of data by removing the redundant data. An example is Principle Component Analysis.

- Dimensionality reduction
- A beginner’s guide to dimensionality reduction in Machine Learning
- Introduction to t-SNE
- Advantages and Disadvantages of t-SNE over PCA (PCA vs t-SNE)

### The Curse of Dimensionality

The *Curse Of Dimensionality* occurs because as the dimensionality increases, the volume of the space increases so fast that the available data become sparse. To combat this you need larger training datasets. This sparsity is problematic for Machine Learning models and causes them to under perform. As the number of features increases, the model becomes more complex. The more features, the greater the chance of overfitting.

- More training data – more records of data are needed to ensure that every combination of features is represented in the training dataset.
- Overfitting – The more features a model is trained on, the more complex our model becomes and the more we risk poor assumptions and potentially fitting to outliers.
- Longer training times – Larger input datasets mean more computational complexity during model training and longer training times.
- Data storage – Larger datasets require more storage space and may become difficult and time consuming to move around.

- https://en.wikipedia.org/wiki/Curse_of_dimensionality
- https://blog.dataiku.com/dimensionality-reduction-how-it-works-in-plain-english

### The advantages of Dimensionality Reduction

The advantages of Dimensionality Reduction are:

- Less misleading data means model accuracy improves.
- Less dimensions mean less computing. Less data means that algorithms train faster.
- Less data means less storage space required.
- Less dimensions allow usage of algorithms unfit for a large number of dimensions
- Removes redundant features and noise.

### Feature Extraction

By creating new features from existing features, Feature Extraction reduces the dimensionality of your data set. Some data has so many features that they are difficult to process, for example Natural Language processing and image processing. The number of features can be reduced by removing repeated or redundant features. Another technique is to cluster features and use keep some that are representative of many others. This helps the Machine Learning algorithms to process the data with less effort and increase the speed of learning and generalization steps in the machine learning process.

Two automated processes are Principle Component Analysis and t-distributed Stochastic Neighbour Embedding.

#### Principal Component Analysis

- https://en.wikipedia.org/wiki/Principal_component_analysis
- https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
- Best explanation without going deep into the mathematics: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
- Simple video explanation: https://www.youtube.com/watch?v=HMOI_lkzW08
- Principal Component Analysis explained visually

##### When to use PCA

Principal Component Analysis is a technique for feature extraction. You should use it when you want to reduce the number of features.

##### What are the advantages of PCA

To increase the accuracy of Machine Learning models increases the volume of data that has to be processed needs to be reduced. With less data to be processed, the models complete their work in less time.

##### What are the weaknesses of PCA

The variables must be independent of each other. However PCA makes the independent variables less interpretable. PCA only works on linear data.

##### What does PCA do

Principal Component Analysis is a technique for feature extraction. Feature extraction increases the accuracy of Machine Learning models by extracting features from the input data. This reduces the dimensionality of data by removing the redundant data.

The input features are combined in a specific way to create new features. This allows us to drop the least important features while still retaining the most valuable parts of all of the features. Each of the new features or components created after PCA are independent of one another.

###### Video – StatQuest: PCA main ideas in only 5 minutes!!!

This is a 6 minute video by Josh Starmer.

- 0:00 Awesome song and introduction
- 0:27 Motivation for using PCA
- 1:23 Correlations among samples
- 3:36 PCA converts correlations into a 2-D graph
- 4:26 Interpreting PCA plots
- 5:08 Other options for dimension reduction

#### t-distributed Stochastic Neighbour Embedding

- https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
- https://www.datacamp.com/community/tutorials/introduction-t-sne
- https://www.displayr.com/using-t-sne-to-visualize-data-before-prediction/#:~:text=The main advantage of t,also produces beautiful looking visualizations.

##### When to use t-SNE

For data visualisation.

##### What are the advantages of t-SNE

The main advantage of t-SNE is the ability to preserve local structure. This means, roughly, that points which are close to one another in the high-dimensional data set will tend to be close to one another in the chart. t-SNE works well on non-linear data.

##### What are the weaknesses of t-SNE

The input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE. It is computationally quite heavy, slow and resource draining when applying it to datasets comprising of more than 10,000 observations.

##### What does t-SNE do

t-distributed Stochastic Neighbour Embedding (t-SNE) minimises the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding. In this way, t-SNE maps the multi-dimensional data to a lower dimensional space and attempts to find patterns in the data by identifying observed clusters based on similarity of data points with multiple features.

###### Video – StatQuest: t-SNE, Clearly Explained

This video by Josh Starmer is 11.47 minutes long. It provides a clear explanation of t-SNE.

- 0:00 Awesome song and introduction
- 1:19 Overview of what t-SNE does
- 2:24 Overview of how t-SNE works
- 4:12 Step 1: Determine high-dimensional similarities
- 9:26 Step 2: Determine low-dimensional similarities
- 10:33 Step 3: Move points in low-d
- 11:05 Why the t-distribution is used instead of the normal distribution

### Feature Selection

Feature Selection is a type of Dimensionality Reduction. Features are removed if they are not as relevant as others to what is being predicted. Unlike Feature Extraction no new features are created. Feature selection algorithms rank features depending on their importance to the prediction and remove the lower ranked features. Feature selection algorithms can be classified into types, one of which is Filtering.

#### Filtering

By applying a score to each feature using a statistical method a Filter Selection method can rank each feature. The ranking can be based only on an individual feature or access a feature in relation to a selected variable. The ranked score determines if the Feature will be retained or filtered out.

## Feature Creation and Transformation

Feature Creation and Transformation is the opposite to Feature Reduction methods because new features are created and the total feature count increases. Some features provide little information by themselves that can be interpreted by a Machine Learning algorithm. By changing them or creating new features more useful information can be exposed.

### Categorical data

Categorical encoding is the process of changing features to numeric data. We do this because many of the Machine Learning models will only process numeric data. We will look at two ways to encode categorical data:

- Label encoding, a type of mapping
- One-hot encoding, binary representation for features

The strategy adopted will depend on if the categories are ordinal or nominal. Ordinal categories have an order of size or importance. Nominal categories have members that are different, but do not have an order.

#### Label encoding

Label encoding, also called ordinal encoding, exchanges a numerical value for the text data in a field. For example vegetable data may describe a Red Peppers (Caspian), Courgettes, Aubergine (Egg plant) and Brockley. A numeral could be assigned to each one:

Name | categorical value |
---|---|

Red Pepper | 1 |

Courgette | 2 |

Aubergine | 3 |

Brockley | 4 |

Weakness: The problem with this approach is if the categories are compared. Is a Courgette worth two Red Peppers? Is a Brockerley twice as big as a Courgette?

#### One-hot encoding

Encoding categorical features i.e. using a number to represent the feature can cause problems if they are compared because it may imply that one is larger than another. One hot encoding provides a method of having a numeric representation of a feature that does not also have a size difference. Binary values are assigned to each category. The zeros and ones form binary variables which show the presence or absence of a category. For example this table shows features of vegetables:

Vegetable name | Green | Black | Red |
---|---|---|---|

Brockley | 1 | 0 | 0 |

Aubergine | 0 | 1 | 0 |

Pepper | 0 | 0 | 1 |

Occasionally there will be two categories that can be predicted from each other, for example the categories “grows above the ground” and “grows below the ground”. A vegetable can either grow above ground, or below the ground, not both. So if there is a zero in the “grows above the ground” category then it can be predicted that there will be a one in the “grows below the ground” category. A better way to describe this is to have a single “growth location” category where zero represents “grows below the ground” and one represents “grows above the ground”.

##### Grouping

The weakness with one-hot encoding is that it can require many categories and make data too large. The solution is grouping, so that one group category can represent many nominal values. You could also have an “others” category for rare values.

##### When to use

Most Machine Learning algorithms need numerical input. So if you have categorical data with labels you need to convert it to numerals. If the labels have a natural order in relation to each other use labeling i.e. map each label to a number. If the data has no order in relation to each label use one-hot encoding. This can be summarised as:

- label encoding: ordinal, ordered labels with a rank relationship to each other
- one-hot encoding: nominal labels, un-ordered labels with no rank relationship to each other

### Numerical data

Many Machine Learning algorithms process numerical values. However problems can occur if the numbers of different features are of dramatically different sizes. This may cause the algorithm give extra weight to the large numbered feature and affect the results. To avoid this situation numerical values can be changed to have a new values that maintain the general distribution and ratios in the source data.

Techniques that can be used are:

- Logarithmic transformation
- Square root
- Cube root
- Bin
- Scaling

#### log

Logarithmic transformation is used to change the shape of the distribution. It can be used to reduce skewness. It can only be used with positive non-zero numeric values.

#### Square root

The square root moderately affects the shape of the distribution, although not as much logarithmic transformation. It can only be used with positive non-zero numeric values.

#### Cube root

The cube root has a high effect on the shape of the distribution. It can be used with negative and zero numeric values.

#### Binning

##### When to use Binning?

When you want to convert numeric data into categorical data.

##### What are the advantages of Binning?

Prevents overfitting

##### What are the weaknesses of Binning?

Can lead to uneven distribution which is overcome by quantile binning. There is a loss of information. There is a loss of performance.

##### What does Binning do?

Binning groups values together into bins. Quantile binning groups values into bins so that each bin holds the same number of data points. The number of bins depend on the characteristics of the variables and its relationship to the target. Optimal number of bins is determined by experimentation.

The purpose of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance. Every time you bin something, you sacrifice information and make your data more regularised. The trade-off between performance and overfitting is the key point of the binning process.

#### Scaling

- https://www.datavedas.com/feature-scaling/
- https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79
- https://medium.com/@sjacks/feature-transformation-21282d1a3215
- https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

The two methods of scaling features are: Normalisation and Standardisation. Each has its own advantages and disadvantages. The method used will depend on the algorithm being used in the model. Many machine learning algorithms require Feature Scaling as this prevents the model from giving greater weighting to some attributes as compared to others. This requirement is due to the algorithms considering all the features together, rather than as separate islands of information. So features have wildly different units the comparison will be unfairly skewed by features with big numbers.

Five popular methods of scaling data are:

- Standardising
- Min-max normalisation
- Maxabs scaling
- Robust scaling
- Mean Normalisation

##### Standardising

Standardising is also called mean/variance standardisation, z-score normalisation.

###### When to use Standardisation?

Standardisation is used when we have data that has a variety of features having different measurement scales or large differences between their ranges. Some models require Standardisation of features, such as clustering algorithms and Models that rely on the distribution of features such as Gaussian processes.

Standardisation assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardisation is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution.

###### What are the advantages of Standardisation?

Many machine learning algorithms require feature scaling as this prevents the model from giving greater weighting to certain attributes as compared to others

###### What are the weaknesses of Standardisation

Outliers affect the outcome, but not as much as other scaling methods.

###### What does Standardisation do

Standardisation takes a list of values and assigns the average value with zero. All other values take a z-score.

##### Min-Max Normalisation

Also known as min-max scaling, rescaling or normalisation.

###### Further reading

###### When to use Min-Max Normalisation

When to use min-max normalisation depends on the specific Machine Learning algorithm that the data is being prepared for. It can be used for k-nearest neighbour where distances are to be calculated or regression where coefficients are to be prepared. However standardisation is more often used. Min-max normalisation is good for image based classification models and Neural Networks

###### What are the advantages of Min-Max Normalisation

Some algorithms require all input features to be normalised. Min-max normalisation helps in reducing the effect of the outliers as it provides smaller standard deviations in the output.

###### What are the weaknesses of Min-Max Normalisation

MinMax Normalisation is sensitive to outliers. It is highly influenced by the maximum and minimum values in our data so if our data contains outliers it is going to be biased. Also the min-max normalisation may compress all inliers in a narrow range.

###### What does Min-Max Normalisation do

For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1. An alternative scale from -1 to 1 could also be used. Does not change distribution’s center and does not correct skewness.

##### Maxabs scaling

Also called maximum absolute value scaling.

###### When to use Maxabs scaling

When you want to preserve sparsity and the center of the distribution on data known to not have outliers.

###### What are the advantages of Maxabs scaling

Doesn’t change distribution’s center and does not destroy any sparsity

###### What are the weaknesses of Maxabs scaling

Sensitive to outliers

###### What does Maxabs scaling do

In simplest terms, MaxAbs scaling takes the absolute maximum value of each column and divides each value in the column by the maximum value. Thus, it first takes the absolute value of each value in the column and then takes the maximum value out of those. This operation scales the data between the range [-1, 1].

##### Robust Scaling

Also known as standardisation or robust data scaling and robust measures of scale.

###### When to use Robust Scaling

When you have outliers that you want handled better than other scaling methods.

###### What are the advantages of Robust Scaling

Use of quartile ranges makes this less sensitive to (a few) outliers. So Robust scaling is not greatly influenced by outliers.

###### What are the weaknesses of Robust Scaling

Robust Scaling has inferior statistical efficiency when the data does not have outliers.

###### What does Robust Scaling do

Outliers removed from the calculation of the mean and standard deviation, then the calculated values to scale the variable are used.

This can be achieved by calculating the median (50th percentile) and the 25th and 75th percentiles. The values of each variable then have their median subtracted and are divided by the interquartile range which is the difference between the 75th and 25th percentiles.

The resulting variable has a zero mean and median and a standard deviation of 1, This is not skewed by outliers which are still present with the same relative relationships to other values.

##### Mean Normalization

###### When to use Mean Normalization

Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

###### What are the advantages of Mean Normalisation

Mean Normalisation can work with non-Gaussian distributed data.

###### What are the weaknesses of Mean Normalisation

Outliers can have a significantly distort the outcome.

###### What does Mean Normalisation do

The process of normalisation makes the smallest value zero, the largest value one and scales all other values to be between zero and one. The result of standardisation is that the features will be rescaled so that they’ll have the properties of a standard normal distribution.

### Date and time data

Dates can be rich source of information. Whilst they can be used to identify when a specific event occurred, they can also tell us more about the context of the event. The date can be used to tell you if is was:

- a weekday, or weekend
- business hours or out of business hours
- a financial period such as a pay day, end of quarter, end of financial year, the season
- a national holiday
- a festival day, for example Christmas, or Eid

### Reformat to usable form

Dates usually arrive in string format in a dataset. This is not that useful because you cannot compare or infer anything from a string, it is just a label. There are three methods to convert a date into a usable form:

- Split the string up into columns, with each column representing a component of the date
- Convert into a timestamp
- Convert into an epoch date.

#### Split into columns

Splitting up the date, or date time can be achieved using string manipulation or regular expressions. Each field holds a text or numerical value for a component of the date, for example day, month, year, time.

#### Timestamp

This can be an internal construct or programming object that exposed methods to extract any aspect of the date time.

#### Epoch date

This is simply a number representing how much time has elapsed from a start date. This makes comparison between dates an simple mathematical process. An example of an epoch date is the Unix epoch date which is 1st January 1970 with all dates being a number in seconds measured from that date.

## Summary

These revision notes have covered the three processes of Feature Engineering: Feature Selection; Feature Extraction; and Feature Creation and Transformation. The first two involve Dimensionality Reduction, while the last one increases the number of features. PCA and t-SNE can help to reduce the number of features. Different feature creation and transformation techniques are used for categorical and numerical data. Labeling and one-hot coding are used for categorical data. Numerical data can be transformed using mathematical transformations, binning and scaling.

##### Concepts

- Feature engineering
- Dimensionality reduction
- The curse of dimensionality
- Feature extraction
- Feature selection
- Feature creation and transformation

##### Techniques

- PCA
- t-SNE
- Filtering
- Label encoding
- One-hot encoding
- Grouping
- Mathematical translations: log, square, cube
- Binning
- Scaling: Standardizing, Min-max, maxabs, robust, Mean normalization
- Date and time data

- https://en.wikipedia.org/wiki/Feature_engineering
- https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
- https://adataanalyst.com/machine-learning/comprehensive-guide-feature-engineering/

###### Credits

- Photo by nrd on Unsplash
- Infographics by Michael Stainsbury using templates from Canva
- Icon images in infographics:
- Edit Tool by Vichanon Chaimsuk from the Noun Project
- teeth by Adrien Coquet from the Noun Project
- choose by priyanka from the Noun Project

##### Notes

These aspects of Feature Engineering are mentioned here, but not expanded upon because they were not stated the AWS Exam Readiness Course.

- Text Feature Engineering
- Image Feature Engineering
- Audio Feature Engineering

#### AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)

#### 10 questions and answers

##### Whizlabs review – AWS Certified Machine Learning Specialty

Need more practice with the exams? Check out Whizlab’s free test with 15 questions. They also have three practice tests (65 questions each) and five section tests (10-15 questions each). Money off promo codes are below. For the AWS Certified Machine Learning Specialty Whizlabs provides a practice tests, a video course and hands-on labs. These…

##### Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

##### Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…