BlazingText Algorithm
BlazingText is the name AWS calls it’s SageMaker built-in algorithm that can identify relationships between words in text documents. These relationships, which are also called embeddings, are expressed as vectors. The semantic relationship between words is preserved by the vectors which cluster words with similar semantics together. This conversion of words to meaningful numeric vectors is very useful for Natural Language Processing which requires input data in vector format. This is why BlazingText is used as a precursor for Natural Language Processing.
BlazingText is an implementation of the Word2Vec algorithm. Word2Vec was published by Google in 2013 and is compatible with Facebook’s FastText.
This article contains revision notes for the AWS certified exam MLS-C01, Machine Learning – Specialty.
This article was published in Medium.com on 5th July 2023.
What does the BlazingText algorithm do
BlazingText is used for Textual analysis and text classification problems. BlazingText is the only SageMaker built in algorithm to have both Unsupervised and Supervised learning modes. Word2Vec is Unsupervised and Text Classification is Supervised learning.
- Word2Vec – Unsupervised learning
- Text Classifier – Supervised learning
Usually for Text Classification you would pre-process the data by passing it through a Word2Vec algorithm and then a Text Classifier. The BlazingText algorithm implements the Word2Vec and Text Classifier as a single process.
How is BlazingText implemented
BlazingText is a SageMaker built-in algorithm and so can be trained via SageMaker Jupyter Notebooks and deployed on SageMaker endpoints. Blazing Test processes text data. The input data is presented in a single file with one sentence per line.
What are the training data formats for BlazingText
There are two input file formats:
- File Mode
- Augmented Manifest Text (AMT) format
The data in File Mode is text with space separated words and one sentence per line. Each line begins with a label like this: __label__1. The data in Augmented Manifest Text format is in JSON (json lines) format. Each line can contain a single sentence or be split up into phrases by commas as a JSON array. Here are some examples:
A single line in File Mode:
__label__1 Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring.
A single JSON line in Augmented Manifest Text format:
{“source”:”Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring”,”label”:1}
A single JSON array containing phrases in Augmented Manifest Text format:
{“source”:”Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring”,”label”:[1,3]}
Model artifacts and inference
Blazing Text uses different artifacts depending on it’s processing mode. This table summarises the file names and formats.
Description | Word2Vec | Text classification |
Model binaries | vectors.bin | model.bin |
Supporting artifacts | Vectors.txt Eval.json (optional) | |
Request format | JSON | JSON |
Result | List of vectors If word not found: zeros | One prediction |
Processing environment
BlazingText can be run on a single CPU or GPU instance, or multiple CPU instances. The choice depends on the type of processing being performed. Word2Vec has three processing methods:
- Skip-gram
- Continuous Bag Of Words (CBOW)
- Batch Skip-gram
These modes are the complete opposites to each other. In skip-gram mode you supply a word and the model returns the context of the word. With CBOW you provide the context and a predicted word is returned.
Processing Method | Skip-gram | CBOW | Batch skip-gram | Text Classification |
Mode | Word2Vec | Word2Vec | Word2Vec | Text Classification |
Single CPU instance | X | X | X | X |
Single GPU instance (with 1 or more GPUs) | X | X | X | |
Multiple CPU instances | X |
From this table you can see that all processing methods can be performed on a single CPU instance. Only Word2Vec using batch skip-gram method can run on multiple CPUs and this method cannot utilise GPUs.
What are BlazingText’s strengths and weaknesses
The strength of BlazingText is high performance. BlazingText is more than 20x faster than other popular alternatives such as Facebook’s FastText. This enables inferences to be done in real time for online transactions rather than batch processing. The main weakness of BlazingText is handling words that were not presented during training. These are called Out Of Vocabulary (OOV) words. Typically such words will be marked as Unknown. There are other ways to perform Word2Vec processing, but they do not have the high performance of BlazingText.
What is the Use Case for BlazingText
BlazingText can only ingest words, so the input data must be text. Word2Vec is required to convert data to vectors for Natural Language Processing.
- Word2Vec:
- Sentiment analysis
- Named entity recognition
- Machine translation
- Text classification:
- Web searches
- Information retrieval
- Ranking
- Document classification
Video: AWS re:Invent 2019: Natural language modeling with Amazon SageMaker BlazingText algorithm (AIM375-P)
This is a 50.36 minutes video from AWS by Denis Batalov (Linkedin profile: https://www.linkedin.com/in/denis-v-batalov-59a3111/). The presentation can be split into four parts as shown in the timestamps below. I suggest you skip the first two parts and start with the overview of SageMaker BlazingText at 17.13 minutes. This is the link to the Jupyter Notebook used in the demo (part 4):
- SageMaker notebook on Github: https://github.com/dbatalov/wikipedia-embedding
- 0 – Introduction
- 2.17 – Word embedding
- 2.56 – Word representations
- 3.43 – One hot encoding
- 4.37 – Intuition, given a sentence, try to maximise the probability of predicting the context of words.
- 6.20 – Word2Vec algorithm
- 8.20 – t-SNE diagram
- 9.23 – Overview of Amazon SageMaker
- 12.20 – Build, train and deploy ML Models
- 13.16 – Built-in algorithms
- 14.10 – Deep learning frameworks
- 15.17 – Automatic Model Tuning
- 16.27 – Amazon SageMaker Neo
- 17.13 – Overview of SageMaker BlazingText
- 18.28 – BlazingText highlights
- 18.45 – Optimization on CPU negative samples sharing
- 19.40 – Through characteristics
- 20.35 – BlazingText benchmarking
- 23.00 – Demo – Georgian Wikipedia
Selected articles with examples of BlazingText being used
This article, by Evan Harris, describes the usefulness of having a website search feature tuned to the specific vocabulary used on the website. The example Evan uses is for a search for a specific grape variety which returns a list of wines that use that variety.
This article has a good worked example of BlazingText being used:
This article is a worked example of using BlazingText in Word2Vec mode: Training Word Embeddings On AWS SageMaker Using BlazingText by Roald Schuring.
This example, from AWS, uses a method to enable BlazingText to generate vectors for out-of-vocabulary (OOV) words.
This is an example SageMake Notebook on Github which uses a dataset derived from Wikipedia.
Video: Amazon SageMaker’s Built-in Algorithm Webinar Series: Blazing Text
This is a 1.14.36 minutes video from AWS by Pratap Ramamurthy (Linked in profile: https://www.linkedin.com/in/pratapramamurthy/). This is a very long video so use the timestamps below to select the part you wish to see.
- 0 – Introduction
- 2.19 – What are Amazon algorithms
- 3.08 – BlazingText algorithms
- 3.17 – BlazingText use case
- 4.16 – Typical deep learning task on Text
- 5.36 – Integer encoding
- 9.20 – One hot encoding
- 14.00 – Requirements for word vectors
- 16.32 – Word2Vec mechanism
- 16.42 – Word2Vec setup
- 18.07 – Skip-gram preprocessing
- 20.30 – Neural network setup
- 25.38 – BlazingText word embedding
- 27.35 – Word vectors used for further ML training
- 28.20 – Intuition
- 28.25 – Random or is there a pattern? (t-SNE plot)
- 31.14 – Distance between related words
- 32.26 – How did the magic work?
- 35.08 – OOV handling using BlazingText
- 39.38 – Subword detection
- 41.43 – Text classification with BlazingText
- 42.18 – Typical NLP pipeline
- 44.25 – Parameters
- 47.43 – Demo
- 100.11 – Questions
Conclusion
BlazingText is a high performance algorithm for analyzing text. The two modes of processing producing either numeric vectors for Natural Language Processing via the Word2Vec algorithm or Text Classifications that can infer words from context or context from words.
Resources
These revision notes support subdomain 3.2 Select the appropriate model(s) for a given machine learning problem of the AWS certification exam: AWS Machine Learning – Speciality (MLS-C01).
3.2 Select the appropriate model(s) for a given machine learning problem.
AWS Certified Machine Learning – Specialty, (MLS-C01) Exam Guide
Xgboost, logistic regression, K-means, linear regression, decision trees, random forests, RNN,
CNN, Ensemble, Transfer learning
Express intuition behind models
- AWS Certified Machine Learning exam guide: www.mlexam.com/aws-machine-learning-exam-guide
- 3 Modeling: www.mlexam.com/home/domain-3-modeling/
- 3.2 Text processing algorithms: www.mlexam.com/sagemaker-text-processing-algorithms
- Questions for SageMaker built-in algorithms and their uses: www.mlexam.com/35-q-a-for-sagemaker-built-in-algorithms
- Free Practice exam: www.mlexam.com/aws-machine-learning-practice-exam
Overview
- AWS docs: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html
- Wikipedia Word2vec: https://en.wikipedia.org/wiki/Word2vec
- Google original papers from 2013: https://arxiv.org/abs/1301.3781
- Google original papers from 2013: https://arxiv.org/abs/1310.4546
Training data format resources
- Augmented Manifest Text (AMT) format: https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html
- Json lines format: http://jsonlines.org/
- Text examples from https://www.britishsummerfruits.co.uk/about
Processing environment
Credits
Burning book photo by Gaspar Uhas on Unsplash
Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.
Whizlab’s AWS Certified Machine Learning Specialty practice exams
Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs
Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.
Practice test content
- Free Practice test – 15 questions
- Practice test 1 – 65 questions
- Practice test 2 – 65 questions
- Practice test 3 – 65 questions