a photgraph of a curving library bookshelf to symbolize the SageMaker text processing algorithm LDA

Latent Dirichlet Allocation Algorithm

SageMaker Latent Dirichlet Allocation algorithm (LDA) is an Unsupervised Learning algorithm that groups words in a document into topics. The topics are found by a probability distribution of all the words in a document. LDA can be used to discover topics shared by documents within a text corpus. The number of topics is specified by the user.

LDA is a bag of words algorithm so the word order does not matter. It attempts to provide a model for inputs and outputs based on latent variables, rather try to map inputs to outputs. LDA can be used to make it easier to search large electronic archives. By providing theme, or topic based searching a user can choose documents that may not have contained the precise words they searched for. As new documents are added they can be processed by LDA to identify their topics and to add new topics to the search feature. LDA was first published in 2002.


Problem attributeDescription
Data types and formatText
Learning paradigm or domainTextual analysis, Unsupervised Learning
Problem typeTopic modeling
Use case examplesOrganize a set of documents into topics (not known in advance)


Input data format can be recordIO-wrapperd-protobuf or CSV.

Model artifacts and inference

Learning paradigmUnsupervised Learning
Request formatCSV

Processing environment

Only CPU instances can be used. Recommended configuration:

  • Training: CPU (single instance)
  • Inference: CPU

Amazon SageMaker’s Built-in Algorithm Webinar Series: Latent Dirichlet Allocation (LDA)

This is a 57 minute from AWS.


Books on curving shelf photo by Susan Yin on Unsplash

Similar Posts