A waterfall symbolizing streaming data for Machine Learning

Streaming data for Machine Learning

Streaming data processing is used when data is continuously being generated and needs to be processed as it arrives. The AWS service for data streaming processing is Kinesis. Kineses comprises of four services each with different capabilities and some that can be used together. As well as Kinesis there is another AWS service that can be used with streaming data, MSK. Amazon MSK is Amazon Managed Streaming for Apache Kafka.

The other form of data ingestion is batch processing, which is discussed in: Batch processing for Machine Learning

Identify and implement a data-ingestion solution is sub-domain 1.2 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus

Questions

To confirm your understanding scroll to the bottom of the page for questions and answers.

Curated videos

Kinesis

Real-time data streams can be ingested by Amazon Kinesis which can process and analyse it as it comes in. Kinesis has four services each with specific capabilities that can be used separately or together.

Infographic to show the four kinesis services and their key attributes
Add this revision card to your Pinterest account

Kinesis Data streams

Kinesis Data Streams can be configured to ingest large quantities of data and store it for upto one week. Data is loaded into Kinesis Data Streams from Data Producers. Data producers produce streaming data as JSON objects or data blobs. There are many data producers for example:

  • IoT devices
  • user interaction data from a website or video games
  • manufacturing devices

Data producers can use the Kinesis Producer Library (KPL) which is a Java library to write to Kinesis. An alternative method is to use the Kinesis API which is a lower level method of interacting with Kinesis and lacks many of the enhanced features of the KPL. However the Kinesis API can be used from any AWS SDK, not just Java. For a comparison see Kinesis KPL vs API

Kinesis Data Streams are a way to ingest this data into AWS. The data is stored in shards from where it can be sent on to data consumers for analysis or storage. Kinesis Data Streams cannot sent data directly to a data repository, it has to go through a data consumer first. Examples of data consumers are:

  • Lambda
  • EC2
  • Kinesis Data Analytics
  • EMR

Data consumers use the Java Kinesis Client Library (KCL) to consume data from Kinesis Data Streams

Shards are an important part of Kinesis Data Streams. They provide both ingestion capacity, at 1000 records per second for each shard and storage capability. Each shard has a unique partition key and each data record has a sequence number. The data record can hold a data blob of up to 1MB. Data is stored from 24 hours up to one week, however storing data is optional. Whilst the number of shards has to be chosen when the Kinesis Data Stream is set up, the number can be changed. This is called re-sharding.

When to use

  • Transfer streaming data into AWS to be processed by data consumers
  • Data must be temporarily stored in case it needs to be reprocessed
  • Data needs to be processed before it is stored
  • Real time analytics by data consumers

Use with other Kinesis services

Kinesis Data Analytics can be used as a data consumer of data from Kinesis Data Streams.

Case study examples

Netflix case study – analysing Flow logs to identify issues rapidly.

Flow logs => EC2 => KDS => EC2

Zillow case study – fast real estate price estimates.

Data Producers => KDS => EMR

Data Producers => KDS => KDF => S3

Thomas Reuters case study – capturing user experience data.

Data Producers => KDS => Lambda => S3

Data Producers => KDS => EC2 => Elastic Search

Video – Amazon Kinesis Data Streams Fundamentals

This video introduces Kinesis Data Streams to slightly more depth than is needed for the Machine Learning exam. However since it is only 5.19 minutes long it is worth watching.

Kinesis Video streams

Process video streams from connected devices. Data producers can be cameras or audio feeds. The streaming data can be sent directly to data consumers to process or a data repository, e.g. S3 to be batch processed by data consumers periodically.

When to use

  • Use Video streams when you need to collect video streaming data for processing and analysis in real-time.
  • Batch-process and store streaming video.
  • Feed streaming data into other AWS services

Kinesis Data Firehose

Receive massive streaming data and store it into an AWS data repository, for example S3, Redshift, EMR. Data comes from a data producer which send streaming data to Kinesis Data Firehose. The data can then be sent directly to a repository such as S3, RedShift, Elastic Search or Splunk. Kinesis Data Firehose can optionally perform processing on the data via Lambda. Kinesis Data Firehose does not have any internal storage, it does not have shards and so no data retention.

When to use

  • Send data directly to a data repository without processing
  • Final destination is S3
  • Data retention is not important

Use with other Kinesis services

Kinesis Data Firehose can be used as a producer for Kinesis Analytics.

Kinesis Data Analytics

Kinesis Data Analytics is an Apache Flink implementation. Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Unbounded data has a start, but no defined end. Bounded data has a start and an end and so is suitable for batch processing. The real-time analysis and processing of streaming data is performed using standard SQL.

When to use

  • When you need to take action in real time
  • When you need to enrich, organise and transform

Use with other Kinesis services

Can take streams from:

Case study examples

An infographic showing examples of case studies using Kinesis Data Analytics
Add this revision card to your Pinterest account

Apache Kafka

Apache Kafka was developed by LinkedIn and open sourced in 2011. It is a publish / subscribe messaging system with storage. In the publish / subscribe pattern the sender does not send data direct to a receiver. The sender or producer sends it to the messaging system, Kafka. Kafka stores the data, or messages for a specific amount of time specified by a retention policy. The receivers, or consumers subscribe to a topic and receive the data they want. Kafka acts as a broker between the data producers and data consumers as in a bulletin board. Kafka can handle real-time data streams reliably and with high throughput.

Kafka can be installed on EC2 instances or used as an Amazon service. Amazon MSK is Amazon Managed Streaming for Apache Kafka.

When to use

  • Streaming ingest
  • ETL (Extract Transform Load)
  • CDC
  • Big data ingest

Video – Apache Kafka on AWS

This AWS video introduces Apache Kafka on AWS. It provides more than enough information for the AWS ML certification exam. Whilst the video lasts 49.36 minutes you only need to watch the first 20 minutes.

Summary

This study guide has discussed Kinesis and Apache Kafka as services in AWS that can consume, process and analyse data streams. Kinesis Data Streams (KDA) and Kafka are they only services that store data natively. Kinesis Data Firehose (KDA) and Kinesis Data Analytics (KDA) can change the data before it is stored via a data consumer. Video streaming data is handled by Kinesis Video Streams (KVS) and stored in S3. This is a summary and comparison of the Kinesis services:

  • KDS can store data internally
  • KDS and KDA cannot write directly to storage
  • KDF and KVS can write directly to storage
  • KDF can process data using Lambda
  • KDF and KDA can change the data
Credits

AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


10 questions and answers

23
Created on By Michael Stainsbury

1.2 Streaming data for Machine Learning (Silver)

The 10 questions of this test, test the study guides for Streaming data for Machine Learning which is part of sub-domain 1.2, Identify and implement a data-ingestion solution, of the Data Engineering knowledge domain.

Which member of the Kinesis family stores data in shards?

1 / 10

2 / 10

What other Kinesis service can act as a data consumer for Kinesis Data Streams?

3 / 10

4 / 10

5 / 10

What AWS services can be used to stream data into Kinesis Data Analytics?

6 / 10

7 / 10

8 / 10

What type of data does Kinesis Video Streams process?

There is no hint for this question.

9 / 10

10 / 10

Your score is

The average score is 79%

0%


Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *