BlazingText Algorithm
BlazingText is the name AWS calls it’s SageMaker built-in algorithm that can identify relationships between words in text documents. These relationships, which are also called embeddings, are expressed as vectors. The semantic relationship between words is preserved by the vectors which cluster words with similar semantics together. This conversion of words to meaningful numeric vectors is very useful for Natural Language Processing which requires input data in vector format. This is why BlazingText is used as a precursor for Natural Language Processing.
BlazingText is an implementation of the Word2Vec algorithm. Word2Vec was published by Google in 2013 and is compatible with Facebook’s FastText. These revision notes are part of subdomain 3.2 Select the appropriate model(s) for a given machine learning problem.
- AWS docs: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html
- https://en.wikipedia.org/wiki/Word2vec
- Google original papers from 2013: https://arxiv.org/abs/1301.3781
- Google original papers from 2013: https://arxiv.org/abs/1310.4546
What does the BlazingText algorithm do
BlazingText is used for Textual analysis and text classification problems. BlazingText has both Unsupervised and Supervised learning modes. Word2Vec is Unsupervised and Text Classification is Supervised learning. BlazingText has two modes:
- Word2Vec
- Text Classifier
Usually for Text Classification you would pre-process the data, pass it through a Word2Vec algorithm and then a Text Classifier. The BlazingText algorithm implements the Word2Vec and Text Classifier as a single process.
How is BlazingText implemented
BlazingText processes text data. The input data is presented in a single file with one sentence per line.
What are the training data formats for BlazingText
There are two input file formats:
- File Mode
- Augmented Manifest Text (AMT) format
The data in File Mode is text with space separated words and one sentence per line. Each line begins with a label like this: __label__1. The data in Augmented Manifest Text format is in JSON (json lines) format. Each line can contain a single sentence or be split up into phrases by commas as a JSON array. Here are some examples:
A single line in File Mode:
__label__1 Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring.
A single JSON line in Augmented Manifest Text format :
{“source”:”Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring”,”label”:1}
A single JSON array containing phrases in Augmented Manifest Text format:
{“source”:”Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring”,”label”:[1,3]}
- AMT: https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html
- Json lines format: http://jsonlines.org/
- Text examples from https://www.britishsummerfruits.co.uk/about
Model artifacts and inference
Description | Word2Vec | Text classification |
Learning paradigm | Unsupervised | Supervised |
Model binaries | vectors.bin | model.bin |
Supporting artifacts | Vectors.txt Eval.json (optional) | |
Request format | JSON | JSON |
Result | List of vectors If word not found: zeros | One prediction |
Processing environment
BlazingText can be run on a single CPU or GPU instance, or multiple CPU instances. The choice depends on the type of processing being performed. Word2Vec has three processing modes:
- Skip-gram
- Continuous Bag Of Words (CBOW)
- Batch Skip-gram
These modes are the complete opposites to each other. In skip-gram mode you supply a word and the model returns the context of the word. With CBOW you provide the context and a predicted word is returned.
Word2Vec | Text Classification | |||
Modes | Skip-gram | CBOW | Batch skip-gram | |
Single CPU instance | X | X | X | X |
Single GPU instance (with 1 or more GPUs) | X | X | X | |
Multiple CPU instances | X |
From this table you can see that all types of processing can be performed on a single CPU instance. Only Word2Vec in batch skip-gram mode can run on multiple CPUs and this mode cannot utilise GPUs.
What are BlazingText’s strengths and weaknesses
The strength of BlazingText is high performance. BlazingText is more than 20x faster than other popular alternatives such as Facebook’s FastText. This enables inferences to be done in real time for online transactions rather than batch processing. The main weakness of BlazingText is handling words that were not presented during training. These are called Out Of Vocabulary (OOV) words. Typically such words will be marked as Unknown. There are other ways to perform Word2Vec processing, but they do not have the high performance of BlazingText.
What is the Use Case for BlazingText
BlazingText can only ingest words, so the input data must be text. Word2Vec is required to convert data to vectors for Natural Language Processing.
- Word2Vec:
- Such as sentiment analysis
- Named entity recognition
- Machine translation
- Text classification:
- Web searches
- Information retrieval
- Ranking
- Document classification
Some examples of BlazingText
Video: AWS re:Invent 2019: Natural language modeling with Amazon SageMaker BlazingText algorithm (AIM375-P)
This is a 50.36 minutes video from AWS by Denis Batalov (Linkedin profile: https://www.linkedin.com/in/denis-v-batalov-59a3111/). The presentation can be split into four parts as shown in the timestamps below. I suggest you skip the first two parts and start with the overview of SageMaker BlazingText at 17.13 minutes. This is the link to the Jupyter Notebook used in the demo (part 4):
- SageMaker notebook on Github: https://github.com/dbatalov/wikipedia-embedding
- 0 – Introduction
- 2.17 – Word embedding
- 2.56 – Word representations
- 3.43 – One hot encoding
- 4.37 – Intuition, given a sentence, try to maximise the probability of predicting the context of words.
- 6.20 – Word2Vec algorithm
- 8.20 – t-SNE diagram
- 9.23 – Overview of Amazon SageMaker
- 12.20 – Build, train and deploy ML Models
- 13.16 – Built-in algorithms
- 14.10 – Deep learning frameworks
- 15.17 – Automatic Model Tuning
- 16.27 – Amazon SageMaker Neo
- 17.13 – Overview of SageMaker BlazingText
- 18.28 – BlazingText highlights
- 18.45 – Optimization on CPU negative samples sharing
- 19.40 – Through characteristics
- 20.35 – BlazingText benchmarking
- 23.00 – Demo – Georgian Wikipedia
Selected articles
This article, by Evan Harris, describes the usefulness of having a website search feature tuned to the specific vocabulary used on the website. The example Evan uses is for a search for a specific grape variety which returns a list of wines that use that variety.
This article has a good worked example of BlazingText being used:
This article is a worked example of using BlazingText in Word2Vec mode: Training Word Embeddings On AWS SageMaker Using BlazingText by Roald Schuring.
This example, from AWS, uses a method to enable BlazingText to generate vectors for out-of-vocabulary (OOV) words.
This is an example SageMake Notebook on Github which uses a dataset derived from Wikipedia.
Video: Amazon SageMaker’s Built-in Algorithm Webinar Series: Blazing Text
This is a 1.14.36 minutes video from AWS by Pratap Ramamurthy (Linked in profile: https://www.linkedin.com/in/pratapramamurthy/). This is a very long video so use the timestamps below to select the part you wish to see.
- 0 – Introduction
- 2.19 – What are Amazon algorithms
- 3.08 – BlazingText algorithms
- 3.17 – BlazingText use case
- 4.16 – Typical deep learning task on Text
- 5.36 – Integer encoding
- 9.20 – One hot encoding
- 14.00 – Requirements for word vectors
- 16.32 – Word2Vec mechanism
- 16.42 – Word2Vec setup
- 18.07 – Skip-gram preprocessing
- 20.30 – Neural network setup
- 25.38 – BlazingText word embedding
- 27.35 – Word vectors used for further ML training
- 28.20 – Intuition
- 28.25 – Random or is there a pattern? (t-SNE plot)
- 31.14 – Distance between related words
- 32.26 – How did the magic work?
- 35.08 – OOV handling using BlazingText
- 39.38 – Subword detection
- 41.43 – Text classification with BlazingText
- 42.18 – Typical NLP pipeline
- 44.25 – Parameters
- 47.43 – Demo
- 100.11 – Questions
Conclusion
BlazingText is a high performance algorithm for analyzing text. The two modes of processing producing either numeric vectors for Natural Language Processing via the Word2Vec algorithm or Text Classifications that can infer words from context or context from words.
Credits
Burning book photo by Gaspar Uhas on Unsplash