Streaming data for Machine Learning
Streaming data processing is used when data is continuously being generated and needs to be processed as it arrives. The AWS service for data streaming processing is Kinesis. Kineses comprises of four services each with different capabilities and some that can be used together. As well as Kinesis there is another AWS service that can be used with streaming data, MSK. Amazon MSK is Amazon Managed Streaming for Apache Kafka.
The other form of data ingestion is batch processing, which is discussed in: Batch processing for Machine Learning
Identify and implement a data-ingestion solution is sub-domain 1.2 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus
Questions
Scroll to the bottom of the page for questions and answers.
Curated videos
Kinesis
Real-time data streams can be ingested by Amazon Kinesis which can process and analyse it as it comes in. Kinesis has four services each with specific capabilities that can be used separately or together.

Kinesis Data streams
Kinesis Data Streams can be configured to ingest large quantities of data and store it for upto one week. Data is loaded into Kinesis Data Streams from Data Producers. Data producers produce streaming data as JSON objects or data blobs. There are many data producers for example:
- IoT devices
- user interaction data from a website or video games
- manufacturing devices
Data producers can use the Kinesis Producer Library (KPL) which is a Java library to write to Kinesis. An alternative method is to use the Kinesis API which is a lower level method of interacting with Kinesis and lacks many of the enhanced features of the KPL. However the Kinesis API can be used from any AWS SDK, not just Java. For a comparison see Kinesis KPL vs API
Kinesis Data Streams are a way to ingest this data into AWS. The data is stored in shards from where it can be sent on to data consumers for analysis or storage. Kinesis Data Streams cannot sent data directly to a data repository, it has to go through a data consumer first. Examples of data consumers are:
- Lambda
- EC2
- Kinesis Data Analytics
- EMR
Data consumers use the Java Kinesis Client Library (KCL) to consume data from Kinesis Data Streams
Shards are an important part of Kinesis Data Streams. They provide both ingestion capacity, at 1000 records per second for each shard and storage capability. Each shard has a unique partition key and each data record has a sequence number. The data record can hold a data blob of up to 1MB. Data is stored from 24 hours up to one week, however storing data is optional. Whilst the number of shards has to be chosen when the Kinesis Data Stream is set up, the number can be changed. This is called re-sharding.
When to use
- Transfer streaming data into AWS to be processed by data consumers
- Data must be temporarily stored in case it needs to be reprocessed
- Data needs to be processed before it is stored
- Real time analytics by data consumers
Use with other Kinesis services
Kinesis Data Analytics can be used as a data consumer of data from Kinesis Data Streams.
Case study examples
Netflix case study – analysing Flow logs to identify issues rapidly.
Flow logs => EC2 => KDS => EC2
Zillow case study – fast real estate price estimates.
Data Producers => KDS => EMR
Data Producers => KDS => KDF => S3
Thomas Reuters case study – capturing user experience data.
Data Producers => KDS => Lambda => S3
Data Producers => KDS => EC2 => Elastic Search
Video – Amazon Kinesis Data Streams Fundamentals
This video introduces Kinesis Data Streams to slightly more depth than is needed for the Machine Learning exam. However since it is only 5.19 minutes long it is worth watching.
Kinesis Video streams
Process video streams from connected devices. Data producers can be cameras or audio feeds. The streaming data can be sent directly to data consumers to process or a data repository, e.g. S3 to be batch processed by data consumers periodically.
When to use
- Use Video streams when you need to collect video streaming data for processing and analysis in real-time.
- Batch-process and store streaming video.
- Feed streaming data into other AWS services
Kinesis Data Firehose
Receive massive streaming data and store it into an AWS data repository, for example S3, Redshift, EMR. Data comes from a data producer which send streaming data to Kinesis Data Firehose. The data can then be sent directly to a repository such as S3, RedShift, Elastic Search or Splunk. Kinesis Data Firehose can optionally perform processing on the data via Lambda. Kinesis Data Firehose does not have any internal storage, it does not have shards and so no data retention.
When to use
- Send data directly to a data repository without processing
- Final destination is S3
- Data retention is not important
Use with other Kinesis services
Kinesis Data Firehose can be used as a producer for Kinesis Analytics.
Kinesis Data Analytics


Kinesis Data Analytics is an Apache Flink implementation. Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Unbounded data has a start, but no defined end. Bounded data has a start and an end and so is suitable for batch processing. The real-time analysis and processing of streaming data is performed using standard SQL.
When to use
- When you need to take action in real time
- When you need to enrich, organise and transform
Use with other Kinesis services
Can take streams from:
Case study examples

- Autodesk case study: https://aws.amazon.com/solutions/case-studies/autodesk-log-analytics/
- Fox case study: https://aws.amazon.com/blogs/media/fox-uses-aws-to-score-with-digital-audiences-for-super-bowl-liv/
- Palringo case study: https://aws.amazon.com/solutions/case-studies/palringo/
Apache Kafka
Apache Kafka was developed by LinkedIn and open sourced in 2011. It is a publish / subscribe messaging system with storage. In the publish / subscribe pattern the sender does not send data direct to a receiver. The sender or producer sends it to the messaging system, Kafka. Kafka stores the data, or messages for a specific amount of time specified by a retention policy. The receivers, or consumers subscribe to a topic and receive the data they want. Kafka acts as a broker between the data producers and data consumers as in a bulletin board. Kafka can handle real-time data streams reliably and with high throughput.
Kafka can be installed on EC2 instances or used as an Amazon service. Amazon MSK is Amazon Managed Streaming for Apache Kafka.
When to use
- Streaming ingest
- ETL (Extract Transform Load)
- CDC
- Big data ingest
Video – Apache Kafka on AWS
This AWS video introduces Apache Kafka on AWS. It provides more than enough information for the AWS ML certification exam. Whilst the video lasts 49.36 minutes you only need to watch the first 20 minutes.
Summary
This study guide has discussed Kinesis and Apache Kafka as services in AWS that can consume, process and analyse data streams. Kinesis Data Streams (KDA) and Kafka are they only services that store data natively. Kinesis Data Firehose (KDA) and Kinesis Data Analytics (KDA) can change the data before it is stored via a data consumer. Video streaming data is handled by Kinesis Video Streams (KVS) and stored in S3. This is a summary and comparison of the Kinesis services:
- KDS can store data internally
- KDS and KDA cannot write directly to storage
- KDF and KVS can write directly to storage
- KDF can process data using Lambda
- KDF and KDA can change the data
Credits
- Waterfall photo by Jeffrey Workman on Unsplash
- AWS icons: Downloaded from https://aws.amazon.com/architecture/icons/
- Kafka logo downloaded from: Wiki commons, https://commons.wikimedia.org/wiki/File:Apache_kafka-icon.svg. Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries
Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.
Whizlabs AWS Certified Machine Learning Specialty
Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs
Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.
Practice test content
- Free Practice test – 15 questions
- Practice test 1 – 65 questions
- Practice test 2 – 65 questions
- Practice test 3 – 65 questions
10 questions and answers
Whizlab’s AWS Certified Machine Learning Specialty course
- In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
- Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
- Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.

Course content
The course has 3 resources which can be purchased seperately, or together:
- 9 Practice tests with 271 questions
- Video course with 65 videos
- 9 hands on labs