Latent Dirichlet Allocation Algorithm
SageMaker Latent Dirichlet Allocation algorithm (LDA) is an Unsupervised Learning algorithm that groups words in a document into topics. The topics are found by a probability distribution of all the words in a document. LDA can be used to discover topics shared by documents within a text corpus. The number of topics is specified by the user.
LDA is a bag of words algorithm so the word order does not matter. It attempts to provide a model for inputs and outputs based on latent variables, rather try to map inputs to outputs. LDA can be used to make it easier to search large electronic archives. By providing theme, or topic based searching a user can choose documents that may not have contained the precise words they searched for. As new documents are added they can be processed by LDA to identify their topics and to add new topics to the search feature. LDA was first published in 2002.
- AWS: https://docs.aws.amazon.com/sagemaker/latest/dg/lda.html
- Beginners guide to LDA: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
- https://en.wikipedia.org/wiki/Topic_model
Attributes
Problem attribute | Description |
Data types and format | Text |
Learning paradigm or domain | Textual analysis, Unsupervised Learning |
Problem type | Topic modeling |
Use case examples | Organize a set of documents into topics (not known in advance) |
Training
Input data format can be recordIO-wrapperd-protobuf or CSV.
Model artifacts and inference
Description | Artifacts |
Learning paradigm | Unsupervised Learning |
Request format | CSV JSON recordIO-protobuf |
Result | JSON recordIO-protobuf |
Processing environment
Only CPU instances can be used. Recommended configuration:
- Training: CPU (single instance)
- Inference: CPU
Amazon SageMaker’s Built-in Algorithm Webinar Series: Latent Dirichlet Allocation (LDA)
This is a 57 minute from AWS.