Safety deposit boxes symbolising Machine Learning data repositories

Machine Learning data repositories compared

Data repositories, or storage, are where data files are stored prior, during and after processing. To realise the full potential of Machine Learning appropriate storage solutions need to be available. The characteristics of the storage most relevant to Machine Learning are:

  1. Cost – particularly for large quantities of data
  2. Availability – how long does it take to make the data ready for processing
  3. Usability – can the preferred Machine Learning and pre-processing tools access the storage and how fast is it

A comparison table is at the end of this Study Guide.

Create data repositories for machine learning is subdomain 1.1 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus

Questions

Scroll to the bottom of the page for questions and answers.

Curated video

There are two types of data repositories for Machine Learning, those that interface directly with Sagemaker and those that store data that has to be transferred to another datastore, usually S3, before it can be used. Originally SageMaker could only receive data from S3 for both training and production. Now there are more options to improve the performance of training.

Repositories for SageMaker

In 2019 AWS announced that SageMaker could accept data directly from EFS and Amazon FSx for Lustre as well as S3. The driver for providing access to new data repositories is to reduce the time spent waiting for S3 objects to complete loading. Block and File based data can be processed as it is received.

Data repositories infographic summarizing S3, EBS, EFS and FSx for Lustre
Add this revision card to your Pinterest account

S3

S3 is an object based data repository. This means that files are stored as single file objects identified by a key. It is massively scalable because AWS maintains vast farms of servers with storage and provisions more when usage thresholds are exceeded. S3 is highly available, durable and the costs are very low.

S3 life cycle configuration

How data is used and it’s storage requirements can change over time. New data may need to be used immediately from low latency storage. As data ages it may be of less importance and can be stored in cheaper higher latency storage and eventually archived or deleted. The process that moves data between different tiers of storage is life cycle management.

S3 life cycle management comprises of two types of actions:

  1. Transition
  2. Expiration

Transition is the process of moving datasets through storage class with different characteristics. Typically this is used to move datasets from highly available storage, ready for immediate processing, to cheaper, reduced availability storage as the data gets older and is less likely needed for immediate processing. For example: datasets will be stored in S3 for 90 days and will then be moved to S3 Glacier.

Expiration is the process by which data is automatically deleted after a certain period of time. For example: all datasets older than 450 days must be deleted. This is important for dealing with regulatory requirements that may require some data to be available for time limited periods.

Infographic to show S3 transition and expiration
Add this revision card to your Pinterest account

S3 Data storage options

AWS S3 has multiple storage options to satisfy the needs of Machine Learning.

General purpose

This is Standard S3. S3 is an object store that has high durability and availability built in. The datasets are instantly accessible. This storage option is for regularly accessed data that must be immediately available.

Unknown or changing access

S3 Intelligent-Tiering is for data that is accessed in an unpredictable way. The data in the repository is automatically moved between an instantly accessed data store to longer term storage and depending when it was last accessed.

Infrequent access

S3 Standard-IA (Infrequent Access). This is for access that is not accessed frequently. One-zone IA is for low value, or easily recreated data.

Archive

AWS S3 Glacier and S3 Glacier deep archive are used to provide long term low cost archiving of data.

AWS Lake Formation

AWS Lake Formation is a service from Amazon that rapidly sets up a data lake with S3 as the data repository. A data lake is a data repository that stores structured and unstructured data at any scale. Data lakes allow data to be centralised and made available for analysis before the purpose of the analysis is defined. The purpose of Lake Formation is to rapidly deploy a data lake as a data repository that includes built in:

  • Security
  • ETL (Extract Transform Load)
  • Formating, for example Parquet, ORC
  • ML to improve data quality

Lake Formation is built on top of AWS Glue, leveraging all the features of that service. During the set up you are taken through a series of options to guide Lake Formation to:

  1. Find the input data sources
  2. Set up the S3 data lake
  3. Move the data to the S3 data lake
  4. Crawl the data to determine it’s structure and build a data catalogue
  5. Perform ETL
  6. Set up security to protect the data.

Once these tasks are complete the data is ready for Machine Learning processing.

Amazon FSx for Lustre

FSx for Lustre is a high performance combination of S3 and SSD storage. Data is presented as files to the Machine Learning models. This enables processing to start immediately without the wait needed for S3 objects to be fully loaded. Lustre (lustre.org) is an open source parallel system supporting High Performance Computing, originally from a research project at the Carnegie Mellon University. FSx for Lustre is Amazon’s way of supercharging its storage, including S3. The features of FSx for Lustre are:

  • high performance storage system
  • low latency
  • high throughput
  • high IOPS
  • multiple underlying storage types

FSx for Lustre can be linked to S3 for concurrent access to both S3 and Lustre high performance file system. S3 objects appear as files.

The Machine Learning use case for FSx for Lustre is for serving massive training data to SageMaker. The concurrent file store allows multiple compute instance to share and work on the data concurrently. FSx for Lustre integrates with SageMaker. The data is lazy loaded, eliminating the time cost of the initial download from S3. This also reduces the cost of accessing common objects for iterative jobs on the same datasets.

Video – Amazon FSx for Lustre, Persistent Storage Overview

This AWS video by Darryl Osborne introduces Amazon FSx for Lustre. The video is 8.49 minutes long, below are the timestamps for the subject covered:

  • 0 – Amazon FSx for Lustre
  • 1.11 – Amazon FSx for Lustre development options
  • 1.37 – Amazon FSx for Lustre demo environment
  • 4.44 – Amazon FSx for Lustre write test
  • 5.40 – Amazon FSx for Lustre read test
  • 6.40 – Amazon FSx for Lustre in memory cache test
  • 8.06 – Git hub scripts
  • 8.49 – End
The AWS video is 8.49 minutes long.

EBS volumes

Elastic Block Store, EBS, volumes are the virtual version of a PC’s hard drive. Data is stored as files and fast access can be specified. The data can be backed via a snapshot for durability and it is also possible to set up RAID configurations. Provisioned capacity, called IOPS, can be expensive, but if you need it, it is there. With EBS you have a single virtual drive that has to be associated with a single EC2 instance. Instances created by SageMaker for SageMaker Notebooks are EC2 instance with EBS volumes.

EFS

EFS is the networked drive version of EBS. With EFS you have multiple EBS drives networked together so that data can be accessed by multiple compute instances. There is both a Standard EFS version and an infrequently accessed version, EFS IA. EFS IA allows you to save costs on store data files that are used less often.

Secondary data repositories

This is a infographic that identifies common data repositories used in AWS machine learning
Add this revision card to your Pinterest account

The data from these data repositories has to be loaded in a data repository that can be consumed by SageMaker. Typically this would be S3.

RDS

Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database. There are many databases to choose from, both open source based (mySQL, PostgreSQL) and vendor owned (Oracle, Microsoft). AWS takes care of most of the admin and maintenance leaving user free to concentrate on using and getting benefit from the database.

Use cases:

  • Data that is relational and structured
  • Data warehouse
  • On Line Transaction Processing
  • Running relational joins and complex updates.

DynamoDB

DynamoDB is a no-SQL database where data is stored as key value pairs. it treats all data within it as being composed of a list of attributes and value. The data can be both structured and unstructured. It is fast and can handle a massive quantities of transactions. DynamoDB use cases are:

  • Non-relational data
  • Structured and less structured data
  • Storing JSON documents

RedShift

Amazon Redshift is a fast massively scale-able data warehouse system. It stores structured data that can be accessed and manipulated by standard SQL. The columnar underlying storage allows complex analytical queries against massive data. RDS use cases are:

  • Data warehouse
  • Structured relational data
  • Complex analytical queries.

Redshift Spectrum

Redshift Spectrum is a feature of Redshift that allows SQL access to data stored in S3. SQL can be used to access data in both Redshift and Redshift Spectrum within the same query. To access the data is must be represented in a Catalogue, for example a Glue Database, or Athena Catalogue. Redshift Spectrum use cases are:

  • Data lake
  • Semi structured data

Timestream

Amazon Timestream is a serverless database for storing time series data, for example from IoT devices or log data. Data is stored and queried by time intervals. Access to data is very fast. Timestream use cases are:

  • Use Timestream to identify trends, patterns and anomalies in time series data

DocumentDB

DocumentDB is a repository optimised for storing and querying JSON documents. It is Apache MongoDB hosted on AWS scalable infrastructure and is marketed by AWS as a way to migrate existing MongoDB instances on to AWS serverless infrastructure.

  • Use to migrate DynamoDB to AWS
  • Store JSON documents
  • Non-relational data
  • Less structured data

Data repositories compared

The data repositories have been compared in the following table by:

  1. Cost – particularly for large quantities of data
  2. Availability – how long does it take to make the data ready for processing
  3. Usability – can the preferred Machine Learning and pre-processing tools access the storage

This is a subjective comparison that may change dramatically depending on the application and size of the data.

RepositoryUsesCostAvailabilityUsability
S3 Standard and Lake FormationGeneral purposeLowFastExcellent
S3 Intelligent-TieringUnknown or changing accessLowerFastExcellent
S3 Standard-IA and One-zone IAData infrequently accessesLowerFastGood
S3 GlacierLong term archivevery lowVery slowNeeds retrieval to S3
FSx for Lustre with S3High speed processingHighVery FastExcellent
EBSHigh speed processingMediumVery FastExcellent
EFSHigh speed processingMediumVery FastExcellent
RDSRelational databaseMediumFastNeeds ELT to S3
DynamoDBNo SQL JSON databaseLowFastNeeds ELT to S3
RedShiftData warehouseLowFastNeeds ELT to S3
TimeStreamTime series optimised database???FastNeeds ELT to S3
DocumentDBJSON document databaseLowFastNeeds ELT to S3

Summary

S3 is king for Machine Learning because the data is in a form that SageMaker can ingest. However there are many repositories that may hold the raw data prior to extraction and processing to make it ready for Machine Learning. There are also some new data repositories that can be access during training to improve performance. This article has briefly described these repositories, their attributes and use cases relevant to Machine Learning.

Credits


AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam

This study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic flashcards, and supplementary online resources. It is available in both paper and kindle version for immediate access. (Vist Amazon books)


Questions and answers

63
Created on By Michael Stainsbury

1.1 Machine Learning data repositories (Silver)

This test has 10 questions that test the study guide for sub-domain 1.1, Create data repositories for machine learning, of the Data Engineering knowledge domain.

1 / 10

What does Redshift Spectrum enable?

2 / 10

What data store is suitable for unstructured and semi-structured data?

3 / 10

4 / 10

What type of databases does RDS provide?

5 / 10

What type of data does Amazon Timestream store?

6 / 10

What data store is suitable for structured data?

7 / 10

What is the maximum amount of data an S3 bucket can hold?

8 / 10

Which of theses databases can RDS provision?

9 / 10

Which data store can you migrate MongoDB to?

10 / 10

What do DynamoDB attributes look like?

Your score is

The average score is 76%

0%


Pluralsight AWS Certified Machine Learning web page screen shot
Reviews
Pluralsight review – AWS Certified Machine Learning Specialty

Contains affiliate links. If you go to Pluralsight’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support. The AWS Certified Machine Learning Specialty learning path from Pluralsight has six high quality video courses taught by expert instructors. Two are introductory…

Amazon Study Guide for the AWS Machine Learning Speciality exam
Reviews
Amazon Study Guide review – AWS Certified Machine Learning Specialty

This Amazon Study Guide review is a review of the official Amazon study guide to accompany the exam. The study guide provides the domain-by-domain specific knowledge you need to build, train, tune, and deploy machine learning models with the AWS Cloud. The online resources that accompany this Study Guide include practice exams and assessments, electronic…


Similar Posts