Safety deposit boxes symbolising Machine Learning data repositories

Machine Learning data repositories compared

Data repositories, or storage, are where data files are stored prior, during and after processing. To realise the full potential of Machine Learning appropriate storage solutions need to be available. The characteristics of the storage most relevant to Machine Learning are:

  1. Cost – particularly for large quantities of data
  2. Availability – how long does it take to make the data ready for processing
  3. Usability – can the preferred Machine Learning and pre-processing tools access the storage and how fast is it

A comparison table is at the end of this Study Guide.

Create data repositories for machine learning is subdomain 1.1 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus

Task Statement 1.1: Create data repositories for ML.
Identify data sources (for example, content and location, primary sources
such as user data).
– Determine storage mediums (for example, databases, Amazon S3, Amazon
Elastic File System [Amazon EFS], Amazon Elastic Block Store [Amazon

AWS Machine Learning – Speciality exam guide


Scroll to the bottom of the page for questions and answers.

Curated video

There are two types of data repositories for Machine Learning, those that interface directly with Sagemaker and those that store data that has to be transferred to another datastore, usually S3, before it can be used. Originally SageMaker could only receive data from S3 for both training and production. Now there are more options to improve the performance of training.

Repositories for SageMaker

In 2019 AWS announced that SageMaker could accept data directly from EFS and Amazon FSx for Lustre as well as S3. The driver for providing access to new data repositories is to reduce the time spent waiting for S3 objects to complete loading. Block and File based data can be processed as it is received.

Data repositories infographic summarizing S3, EBS, EFS and FSx for Lustre
Add this revision card to your Pinterest account


S3 is an object based data repository. This means that files are stored as single file objects identified by a key. It is massively scalable because AWS maintains vast farms of servers with storage and provisions more when usage thresholds are exceeded. S3 is highly available, durable and the costs are very low.

S3 life cycle configuration

How data is used and it’s storage requirements can change over time. New data may need to be used immediately from low latency storage. As data ages it may be of less importance and can be stored in cheaper higher latency storage and eventually archived or deleted. The process that moves data between different tiers of storage is life cycle management.

S3 life cycle management comprises of two types of actions:

  1. Transition
  2. Expiration

Transition is the process of moving datasets through storage class with different characteristics. Typically this is used to move datasets from highly available storage, ready for immediate processing, to cheaper, reduced availability storage as the data gets older and is less likely needed for immediate processing. For example: datasets will be stored in S3 for 90 days and will then be moved to S3 Glacier.

Expiration is the process by which data is automatically deleted after a certain period of time. For example: all datasets older than 450 days must be deleted. This is important for dealing with regulatory requirements that may require some data to be available for time limited periods.

Infographic to show S3 transition and expiration
Add this revision card to your Pinterest account

S3 Data storage options

AWS S3 has multiple storage options to satisfy the needs of Machine Learning.

General purpose

This is Standard S3. S3 is an object store that has high durability and availability built in. The datasets are instantly accessible. This storage option is for regularly accessed data that must be immediately available.

Unknown or changing access

S3 Intelligent-Tiering is for data that is accessed in an unpredictable way. The data in the repository is automatically moved between an instantly accessed data store to longer term storage and depending when it was last accessed.

Infrequent access

S3 Standard-IA (Infrequent Access). This is for access that is not accessed frequently. One-zone IA is for low value, or easily recreated data.


AWS S3 Glacier and S3 Glacier deep archive are used to provide long term low cost archiving of data.

AWS Lake Formation

AWS Lake Formation is a service from Amazon that rapidly sets up a data lake with S3 as the data repository. A data lake is a data repository that stores structured and unstructured data at any scale. Data lakes allow data to be centralised and made available for analysis before the purpose of the analysis is defined. The purpose of Lake Formation is to rapidly deploy a data lake as a data repository that includes built in:

  • Security
  • ETL (Extract Transform Load)
  • Formating, for example Parquet, ORC
  • ML to improve data quality

Lake Formation is built on top of AWS Glue, leveraging all the features of that service. During the set up you are taken through a series of options to guide Lake Formation to:

  1. Find the input data sources
  2. Set up the S3 data lake
  3. Move the data to the S3 data lake
  4. Crawl the data to determine it’s structure and build a data catalogue
  5. Perform ETL
  6. Set up security to protect the data.

Once these tasks are complete the data is ready for Machine Learning processing.

Amazon FSx for Lustre

FSx for Lustre is a high performance combination of S3 and SSD storage. Data is presented as files to the Machine Learning models. This enables processing to start immediately without the wait needed for S3 objects to be fully loaded. Lustre ( is an open source parallel system supporting High Performance Computing, originally from a research project at the Carnegie Mellon University. FSx for Lustre is Amazon’s way of supercharging its storage, including S3. The features of FSx for Lustre are:

  • high performance storage system
  • low latency
  • high throughput
  • high IOPS
  • multiple underlying storage types

FSx for Lustre can be linked to S3 for concurrent access to both S3 and Lustre high performance file system. S3 objects appear as files.

The Machine Learning use case for FSx for Lustre is for serving massive training data to SageMaker. The concurrent file store allows multiple compute instance to share and work on the data concurrently. FSx for Lustre integrates with SageMaker. The data is lazy loaded, eliminating the time cost of the initial download from S3. This also reduces the cost of accessing common objects for iterative jobs on the same datasets.

Video – Amazon FSx for Lustre, Persistent Storage Overview

This AWS video by Darryl Osborne introduces Amazon FSx for Lustre. The video is 8.49 minutes long, below are the timestamps for the subject covered:

  • 0 – Amazon FSx for Lustre
  • 1.11 – Amazon FSx for Lustre development options
  • 1.37 – Amazon FSx for Lustre demo environment
  • 4.44 – Amazon FSx for Lustre write test
  • 5.40 – Amazon FSx for Lustre read test
  • 6.40 – Amazon FSx for Lustre in memory cache test
  • 8.06 – Git hub scripts
  • 8.49 – End
The AWS video is 8.49 minutes long.

EBS volumes

Elastic Block Store, EBS, volumes are the virtual version of a PC’s hard drive. Data is stored as files and fast access can be specified. The data can be backed via a snapshot for durability and it is also possible to set up RAID configurations. Provisioned capacity, called IOPS, can be expensive, but if you need it, it is there. With EBS you have a single virtual drive that has to be associated with a single EC2 instance. Instances created by SageMaker for SageMaker Notebooks are EC2 instance with EBS volumes.


EFS is the networked drive version of EBS. With EFS you have multiple EBS drives networked together so that data can be accessed by multiple compute instances. There is both a Standard EFS version and an infrequently accessed version, EFS IA. EFS IA allows you to save costs on store data files that are used less often.

Secondary data repositories

This is a infographic that identifies common data repositories used in AWS machine learning
Add this revision card to your Pinterest account

The data from these data repositories has to be loaded in a data repository that can be consumed by SageMaker. Typically this would be S3.


Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database. There are many databases to choose from, both open source based (mySQL, PostgreSQL) and vendor owned (Oracle, Microsoft). AWS takes care of most of the admin and maintenance leaving user free to concentrate on using and getting benefit from the database.

Use cases:

  • Data that is relational and structured
  • Data warehouse
  • On Line Transaction Processing
  • Running relational joins and complex updates.


DynamoDB is a no-SQL database where data is stored as key value pairs. it treats all data within it as being composed of a list of attributes and value. The data can be both structured and unstructured. It is fast and can handle a massive quantities of transactions. DynamoDB use cases are:

  • Non-relational data
  • Structured and less structured data
  • Storing JSON documents


Amazon Redshift is a fast massively scale-able data warehouse system. It stores structured data that can be accessed and manipulated by standard SQL. The columnar underlying storage allows complex analytical queries against massive data. RDS use cases are:

  • Data warehouse
  • Structured relational data
  • Complex analytical queries.

Useful commands:

  • The UNLOAD command is used to save a table to a set of files on S3.
  • The COPY command is used to take data from an S3 bucket and place it in a RedShift table.

Redshift Spectrum

Redshift Spectrum is a feature of Redshift that allows SQL access to data stored in S3. SQL can be used to access data in both Redshift and Redshift Spectrum within the same query. To access the data is must be represented in a Catalogue, for example a Glue Database, or Athena Catalogue. Redshift Spectrum use cases are:

  • Data lake
  • Semi structured data


Amazon Timestream is a serverless database for storing time series data, for example from IoT devices or log data. Data is stored and queried by time intervals. Access to data is very fast. Timestream use cases are:

  • Use Timestream to identify trends, patterns and anomalies in time series data


DocumentDB is a repository optimised for storing and querying JSON documents. It is Apache MongoDB hosted on AWS scalable infrastructure and is marketed by AWS as a way to migrate existing MongoDB instances on to AWS serverless infrastructure.

  • Use to migrate DynamoDB to AWS
  • Store JSON documents
  • Non-relational data
  • Less structured data

Data repositories compared

The data repositories have been compared in the following table by:

  1. Cost – particularly for large quantities of data
  2. Availability – how long does it take to make the data ready for processing
  3. Usability – can the preferred Machine Learning and pre-processing tools access the storage

This is a subjective comparison that may change dramatically depending on the application and size of the data.

S3 Standard and Lake FormationGeneral purposeLowFastExcellent
S3 Intelligent-TieringUnknown or changing accessLowerFastExcellent
S3 Standard-IA and One-zone IAData infrequently accessesLowerFastGood
S3 GlacierLong term archivevery lowVery slowNeeds retrieval to S3
FSx for Lustre with S3High speed processingHighVery FastExcellent
EBSHigh speed processingMediumVery FastExcellent
EFSHigh speed processingMediumVery FastExcellent
RDSRelational databaseMediumFastNeeds ELT to S3
DynamoDBNo SQL JSON databaseLowFastNeeds ELT to S3
RedShiftData warehouseLowFastNeeds ELT to S3
TimeStreamTime series optimised database???FastNeeds ELT to S3
DocumentDBJSON document databaseLowFastNeeds ELT to S3

Data structure types

Structured data is data that has a standard format for efficient access by software and humans a like. The data is typically tabular with rows and columns that clearly define attributes. Examples of structured data are:

  • Excel files
  • Relational databases
  • Point of sale data
  • Web forms

Unstructured data is information with no set data model or data that has not yet been ordered in a predefined way. Examples of unstructured data:

  • Text files
  • Video files
  • Reports
  • Email
  • Images

Semi-structured data sits between structured and unstructured data. Semi-structured data has some attributes of both structured and unstructured data. Examples of semi-structured data are:

  • JSON
  • XML
  • Email
  • Zipped files

Repository data structures compared

The data repositories have been compared in the following table by:

  1. Data structure – the type of structure of the data
  2. File type that can be stored
  3. Data type that the data must be formatted
RepositoryData structureFile typeData type
FSx for lustreAllappear as filesAll
Key value pairs
JSON docs
Key value pairs
JSON docs
RedShift SpectrumSemistructuredAny
TimestreamStructured (Schemerless)Time series data


S3 is king for Machine Learning because the data is in a form that SageMaker can ingest. However there are many repositories that may hold the raw data prior to extraction and processing to make it ready for Machine Learning. There are also some new data repositories that can be access during training to improve performance. This article has briefly described these repositories, their attributes and use cases relevant to Machine Learning.


Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.

Whizlabs AWS Certified Machine Learning Specialty

Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs

Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.

Practice test content

  • Free Practice test – 15 questions
  • Practice test 1 – 65 questions
  • Practice test 2 – 65 questions
  • Practice test 3 – 65 questions
Whizlabs AWS certified machine learning course with a robot hand

Section test content

  • Core ML Concepts – 10 questions
  • Data Engineering – 11 questions
  • Exploratory Data Analysis – 13 questions
  • Modeling – 15 questions
  • Machine Learning Implementation and Operations – 12 questions

Questions and answers

Created on By Michael Stainsbury

1.1 Machine Learning data repositories (Silver)

This test has 10 questions that test the study guide for sub-domain 1.1, Create data repositories for machine learning, of the Data Engineering knowledge domain.

1 / 10

What do DynamoDB attributes look like?

2 / 10

Which of theses databases can RDS provision?

3 / 10

Which data store can you migrate MongoDB to?

4 / 10

What data store is suitable for structured data?

5 / 10

What does Redshift Spectrum enable?

6 / 10

What is the maximum amount of data an S3 bucket can hold?

7 / 10

What data store is suitable for unstructured and semi-structured data?

8 / 10

What type of databases does RDS provide?

9 / 10

10 / 10

What type of data does Amazon Timestream store?

Your score is

The average score is 72%


Whizlab’s AWS Certified Machine Learning Specialty course

  • In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
  • Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
  • Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.
Whizlabs AWS certified machine learning course with a robot hand

Course content

The course has 3 resources which can be purchased seperately, or together:

  • 9 Practice tests with 271 questions
  • Video course with 65 videos
  • 9 hands on labs

Similar Posts