a photograph of a burning book held in a hand to symbolize the SageMaker built-in algorithm BlazingText

BlazingText Algorithm

BlazingText is the name AWS calls it’s SageMaker built-in algorithm that can identify relationships between words in text documents. These relationships, which are also called embeddings, are expressed as vectors. The semantic relationship between words is preserved by the vectors which cluster words with similar semantics together. This conversion of words to meaningful numeric vectors is very useful for Natural Language Processing which requires input data in vector format. This is why BlazingText is used as a precursor for Natural Language Processing.

BlazingText is an implementation of the Word2Vec algorithm. Word2Vec was published by Google in 2013 and is compatible with Facebook’s FastText. These revision notes are part of subdomain 3.2 Select the appropriate model(s) for a given machine learning problem.

What does the BlazingText algorithm do

BlazingText is used for Textual analysis and text classification problems. BlazingText has both Unsupervised and Supervised learning modes. Word2Vec is Unsupervised and Text Classification is Supervised learning. BlazingText has two modes:

  1. Word2Vec
  2. Text Classifier

Usually for Text Classification you would pre-process the data, pass it through a Word2Vec algorithm and then a Text Classifier. The BlazingText algorithm implements the Word2Vec and Text Classifier as a single process.

How is BlazingText implemented

BlazingText processes text data. The input data is presented in a single file with one sentence per line.

What are the training data formats for BlazingText

There are two input file formats:

  1. File Mode
  2. Augmented Manifest Text (AMT) format

The data in File Mode is text with space separated words and one sentence per line. Each line begins with a label like this: __label__1. The data in Augmented Manifest Text format is in JSON (json lines) format. Each line can contain a single sentence or be split up into phrases by commas as a JSON array. Here are some examples:

A single line in File Mode:

__label__1 Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring.

A single JSON line in Augmented Manifest Text format :

{“source”:”Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring”,”label”:1}

A single JSON array containing phrases in Augmented Manifest Text format: 

{“source”:”Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring”,”label”:[1,3]}

Model artifacts and inference

DescriptionWord2VecText classification
Learning paradigmUnsupervisedSupervised
Model binariesvectors.binmodel.bin
Supporting artifactsVectors.txt
Eval.json (optional)
Request formatJSONJSON
ResultList of vectors
If word not found: zeros
One prediction

Processing environment

BlazingText can be run on a single CPU or GPU instance, or multiple CPU instances. The choice depends on the type of processing being performed. Word2Vec has three processing modes:

  1. Skip-gram
  2. Continuous Bag Of Words (CBOW)
  3. Batch Skip-gram

These modes are the complete opposites to each other. In skip-gram mode you supply a word and the model returns the context of the word. With CBOW you provide the context and a predicted word is returned.

Word2VecText Classification
ModesSkip-gramCBOWBatch skip-gram
Single CPU instanceXXXX
Single GPU instance (with 1 or more GPUs)XXX
Multiple CPU instancesX

From this table you can see that all types of processing can be performed on a single CPU instance. Only Word2Vec in batch skip-gram mode can run on multiple CPUs and this mode cannot utilise GPUs.

What are BlazingText’s strengths and weaknesses

The strength of BlazingText is high performance. BlazingText is more than 20x faster than other popular alternatives such as Facebook’s FastText. This enables inferences to be done in real time for online transactions rather than batch processing. The main weakness of BlazingText is handling words that were not presented during training. These are called Out Of Vocabulary (OOV) words. Typically such words will be marked as Unknown. There are other ways to perform Word2Vec processing, but they do not have the high performance of BlazingText.

What is the Use Case for BlazingText

BlazingText can only ingest words, so the input data must be text. Word2Vec is required to convert data to vectors for Natural Language Processing.

  • Word2Vec:
    • Such as sentiment analysis
    • Named entity recognition
    • Machine translation
  • Text classification:
    • Web searches
    • Information retrieval
    • Ranking
    • Document classification

Some examples of BlazingText

Video: AWS re:Invent 2019: Natural language modeling with Amazon SageMaker BlazingText algorithm (AIM375-P)

This is a 50.36 minutes video from AWS by Denis Batalov (Linkedin profile: https://www.linkedin.com/in/denis-v-batalov-59a3111/). The presentation can be split into four parts as shown in the timestamps below. I suggest you skip the first two parts and start with the overview of SageMaker BlazingText at 17.13 minutes. This is the link to the Jupyter Notebook used in the demo (part 4):

  • 0 – Introduction
  • 2.17 – Word embedding
  • 2.56 – Word representations
  • 3.43 – One hot encoding
  • 4.37 – Intuition, given a sentence, try to maximise the probability of predicting the context of words.
  • 6.20 – Word2Vec algorithm
  • 8.20 – t-SNE diagram
  • 9.23 – Overview of Amazon SageMaker
  • 12.20 – Build, train and deploy ML Models
  • 13.16 – Built-in algorithms
  • 14.10 – Deep learning frameworks
  • 15.17 – Automatic Model Tuning
  • 16.27 – Amazon SageMaker Neo
  • 17.13 – Overview of SageMaker BlazingText
  • 18.28 – BlazingText highlights
  • 18.45 – Optimization on CPU negative samples sharing
  • 19.40 – Through characteristics
  • 20.35 – BlazingText benchmarking
  • 23.00 – Demo – Georgian Wikipedia

Selected articles

This article, by Evan Harris, describes the usefulness of having a website search feature tuned to the specific vocabulary used on the website. The example Evan uses is for a search for a specific grape variety which returns a list of wines that use that variety.

This article has a good worked example of BlazingText being used:

This article is a worked example of using BlazingText in Word2Vec mode: Training Word Embeddings On AWS SageMaker Using BlazingText by Roald Schuring.

This example, from AWS, uses a method to enable BlazingText to generate vectors for out-of-vocabulary (OOV) words. 

This is an example SageMake Notebook on Github which uses a dataset derived from Wikipedia.

Video: Amazon SageMaker’s Built-in Algorithm Webinar Series: Blazing Text

This is a 1.14.36 minutes video from AWS by Pratap Ramamurthy (Linked in profile: https://www.linkedin.com/in/pratapramamurthy/). This is a very long video so use the timestamps below to select the part you wish to see.

  • 0 – Introduction
  • 2.19 – What are Amazon algorithms
  • 3.08 – BlazingText algorithms
  • 3.17 – BlazingText use case
  • 4.16 – Typical deep learning task on Text
  • 5.36 – Integer encoding
  • 9.20 – One hot encoding
  • 14.00 – Requirements for word vectors
  • 16.32 – Word2Vec mechanism
  • 16.42 – Word2Vec setup
  • 18.07 – Skip-gram preprocessing
  • 20.30 – Neural network setup
  • 25.38 – BlazingText word embedding
  • 27.35 – Word vectors used for further ML training
  • 28.20 – Intuition
  • 28.25 – Random or is there a pattern? (t-SNE plot)
  • 31.14 – Distance between related words
  • 32.26 – How did the magic work?
  • 35.08 – OOV handling using BlazingText
  • 39.38 – Subword detection
  • 41.43 – Text classification with BlazingText
  • 42.18 – Typical NLP pipeline
  • 44.25 – Parameters
  • 47.43 – Demo
  • 100.11 – Questions

Conclusion

BlazingText is a high performance algorithm for analyzing text. The two modes of processing producing either numeric vectors for Natural Language Processing via the Word2Vec algorithm or Text Classifications that can infer words from context or context from words.

Credits

Burning book photo by Gaspar Uhas on Unsplash

Similar Posts