A waterfall symbolizing streaming data for Machine Learning

Streaming data for Machine Learning

Streaming data processing is used when data is continuously being generated and needs to be processed as it arrives. The AWS service for data streaming processing is Kinesis. Kineses comprises of four services each with different capabilities and some that can be used together. As well as Kinesis there is another AWS service that can be used with streaming data, MSK. Amazon MSK is Amazon Managed Streaming for Apache Kafka.

The other form of data ingestion is batch processing, which is discussed in: Batch processing for Machine Learning

Identify and implement a data-ingestion solution is sub-domain 1.2 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus


Scroll to the bottom of the page for questions and answers.

Curated videos


Real-time data streams can be ingested by Amazon Kinesis which can process and analyse it as it comes in. Kinesis has four services each with specific capabilities that can be used separately or together.

Infographic to show the four kinesis services and their key attributes
Add this revision card to your Pinterest account

Kinesis Data streams

Kinesis Data Streams can be configured to ingest large quantities of data and store it for upto one week. Data is loaded into Kinesis Data Streams from Data Producers. Data producers produce streaming data as JSON objects or data blobs. There are many data producers for example:

  • IoT devices
  • user interaction data from a website or video games
  • manufacturing devices

Data producers can use the Kinesis Producer Library (KPL) which is a Java library to write to Kinesis. An alternative method is to use the Kinesis API which is a lower level method of interacting with Kinesis and lacks many of the enhanced features of the KPL. However the Kinesis API can be used from any AWS SDK, not just Java. For a comparison see Kinesis KPL vs API

Kinesis Data Streams are a way to ingest this data into AWS. The data is stored in shards from where it can be sent on to data consumers for analysis or storage. Kinesis Data Streams cannot sent data directly to a data repository, it has to go through a data consumer first. Examples of data consumers are:

  • Lambda
  • EC2
  • Kinesis Data Analytics
  • EMR

Data consumers use the Java Kinesis Client Library (KCL) to consume data from Kinesis Data Streams

Shards are an important part of Kinesis Data Streams. They provide both ingestion capacity, at 1000 records per second for each shard and storage capability. Each shard has a unique partition key and each data record has a sequence number. The data record can hold a data blob of up to 1MB. Data is stored from 24 hours up to one week, however storing data is optional. Whilst the number of shards has to be chosen when the Kinesis Data Stream is set up, the number can be changed. This is called re-sharding.

When to use

  • Transfer streaming data into AWS to be processed by data consumers
  • Data must be temporarily stored in case it needs to be reprocessed
  • Data needs to be processed before it is stored
  • Real time analytics by data consumers

Use with other Kinesis services

Kinesis Data Analytics can be used as a data consumer of data from Kinesis Data Streams.

Case study examples

Netflix case study – analysing Flow logs to identify issues rapidly.

Flow logs => EC2 => KDS => EC2

Zillow case study – fast real estate price estimates.

Data Producers => KDS => EMR

Data Producers => KDS => KDF => S3

Thomas Reuters case study – capturing user experience data.

Data Producers => KDS => Lambda => S3

Data Producers => KDS => EC2 => Elastic Search

Video – Amazon Kinesis Data Streams Fundamentals

This video introduces Kinesis Data Streams to slightly more depth than is needed for the Machine Learning exam. However since it is only 5.19 minutes long it is worth watching.

Kinesis Video streams

Process video streams from connected devices. Data producers can be cameras or audio feeds. The streaming data can be sent directly to data consumers to process or a data repository, e.g. S3 to be batch processed by data consumers periodically.

When to use

  • Use Video streams when you need to collect video streaming data for processing and analysis in real-time.
  • Batch-process and store streaming video.
  • Feed streaming data into other AWS services

Kinesis Data Firehose

Receive massive streaming data and store it into an AWS data repository, for example S3, Redshift, EMR. Data comes from a data producer which send streaming data to Kinesis Data Firehose. The data can then be sent directly to a repository such as S3, RedShift, Elastic Search or Splunk. Kinesis Data Firehose can optionally perform processing on the data via Lambda. Kinesis Data Firehose does not have any internal storage, it does not have shards and so no data retention.

When to use

  • Send data directly to a data repository without processing
  • Final destination is S3
  • Data retention is not important

Use with other Kinesis services

Kinesis Data Firehose can be used as a producer for Kinesis Analytics.

Kinesis Data Analytics

Kinesis Data Analytics is an Apache Flink implementation. Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Unbounded data has a start, but no defined end. Bounded data has a start and an end and so is suitable for batch processing. The real-time analysis and processing of streaming data is performed using standard SQL.

When to use

  • When you need to take action in real time
  • When you need to enrich, organise and transform

Use with other Kinesis services

Can take streams from:

Case study examples

An infographic showing examples of case studies using Kinesis Data Analytics
Add this revision card to your Pinterest account

Apache Kafka

Apache Kafka was developed by LinkedIn and open sourced in 2011. It is a publish / subscribe messaging system with storage. In the publish / subscribe pattern the sender does not send data direct to a receiver. The sender or producer sends it to the messaging system, Kafka. Kafka stores the data, or messages for a specific amount of time specified by a retention policy. The receivers, or consumers subscribe to a topic and receive the data they want. Kafka acts as a broker between the data producers and data consumers as in a bulletin board. Kafka can handle real-time data streams reliably and with high throughput.

Kafka can be installed on EC2 instances or used as an Amazon service. Amazon MSK is Amazon Managed Streaming for Apache Kafka.

When to use

  • Streaming ingest
  • ETL (Extract Transform Load)
  • CDC
  • Big data ingest

Video – Apache Kafka on AWS

This AWS video introduces Apache Kafka on AWS. It provides more than enough information for the AWS ML certification exam. Whilst the video lasts 49.36 minutes you only need to watch the first 20 minutes.


This study guide has discussed Kinesis and Apache Kafka as services in AWS that can consume, process and analyse data streams. Kinesis Data Streams (KDA) and Kafka are they only services that store data natively. Kinesis Data Firehose (KDA) and Kinesis Data Analytics (KDA) can change the data before it is stored via a data consumer. Video streaming data is handled by Kinesis Video Streams (KVS) and stored in S3. This is a summary and comparison of the Kinesis services:

  • KDS can store data internally
  • KDS and KDA cannot write directly to storage
  • KDF and KVS can write directly to storage
  • KDF can process data using Lambda
  • KDF and KDA can change the data

AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)

10 questions and answers

Created on By Michael Stainsbury

1.2 Streaming data for Machine Learning (Silver)

The 10 questions of this test, test the study guides for Streaming data for Machine Learning which is part of sub-domain 1.2, Identify and implement a data-ingestion solution, of the Data Engineering knowledge domain.

1 / 10

2 / 10

3 / 10

What type of data does Kinesis Video Streams process?

4 / 10

What AWS services can be used to stream data into Kinesis Data Analytics?

5 / 10

6 / 10

What other Kinesis service can act as a data consumer for Kinesis Data Streams?

There is no hint for this question.

7 / 10

8 / 10

9 / 10

Which member of the Kinesis family stores data in shards?

10 / 10

Your score is

The average score is 77%


Amazon Study Guide for the AWS Machine Learning Speciality exam
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…

static image of cv library ad showing a blue owl and the text looking for you next job? Register cv
CV Library

If you want to land your dream AWS job you have to do more than just dream about it you need a CV. Agents may call, email or text and job ads pop up on every site you visit but the first thing they will ask for is a copy of your CV. A CV…

Similar Posts