Machine Learning data repositories compared
Data repositories, or storage, are where data files are stored prior, during and after processing. To realise the full potential of Machine Learning appropriate storage solutions need to be available. The characteristics of the storage most relevant to Machine Learning are:
- Cost – particularly for large quantities of data
- Availability – how long does it take to make the data ready for processing
- Usability – can the preferred Machine Learning and pre-processing tools access the storage and how fast is it
A comparison table is at the end of this Study Guide.
Create data repositories for machine learning is subdomain 1.1 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam syllabus
Task Statement 1.1: Create data repositories for ML.
AWS Machine Learning – Speciality exam guide
– Identify data sources (for example, content and location, primary sources
such as user data).
– Determine storage mediums (for example, databases, Amazon S3, Amazon
Elastic File System [Amazon EFS], Amazon Elastic Block Store [Amazon
EBS]).
Questions
Scroll to the bottom of the page for questions and answers.
Curated video
There are two types of data repositories for Machine Learning, those that interface directly with Sagemaker and those that store data that has to be transferred to another datastore, usually S3, before it can be used. Originally SageMaker could only receive data from S3 for both training and production. Now there are more options to improve the performance of training.
Repositories for SageMaker
In 2019 AWS announced that SageMaker could accept data directly from EFS and Amazon FSx for Lustre as well as S3. The driver for providing access to new data repositories is to reduce the time spent waiting for S3 objects to complete loading. Block and File based data can be processed as it is received.

S3
S3 is an object based data repository. This means that files are stored as single file objects identified by a key. It is massively scalable because AWS maintains vast farms of servers with storage and provisions more when usage thresholds are exceeded. S3 is highly available, durable and the costs are very low.
S3 life cycle configuration
How data is used and it’s storage requirements can change over time. New data may need to be used immediately from low latency storage. As data ages it may be of less importance and can be stored in cheaper higher latency storage and eventually archived or deleted. The process that moves data between different tiers of storage is life cycle management.
S3 life cycle management comprises of two types of actions:
- Transition
- Expiration
Transition is the process of moving datasets through storage class with different characteristics. Typically this is used to move datasets from highly available storage, ready for immediate processing, to cheaper, reduced availability storage as the data gets older and is less likely needed for immediate processing. For example: datasets will be stored in S3 for 90 days and will then be moved to S3 Glacier.
Expiration is the process by which data is automatically deleted after a certain period of time. For example: all datasets older than 450 days must be deleted. This is important for dealing with regulatory requirements that may require some data to be available for time limited periods.

S3 Data storage options
AWS S3 has multiple storage options to satisfy the needs of Machine Learning.
General purpose
This is Standard S3. S3 is an object store that has high durability and availability built in. The datasets are instantly accessible. This storage option is for regularly accessed data that must be immediately available.
Unknown or changing access
S3 Intelligent-Tiering is for data that is accessed in an unpredictable way. The data in the repository is automatically moved between an instantly accessed data store to longer term storage and depending when it was last accessed.
Infrequent access
S3 Standard-IA (Infrequent Access). This is for access that is not accessed frequently. One-zone IA is for low value, or easily recreated data.
Archive
AWS S3 Glacier and S3 Glacier deep archive are used to provide long term low cost archiving of data.
AWS Lake Formation
AWS Lake Formation is a service from Amazon that rapidly sets up a data lake with S3 as the data repository. A data lake is a data repository that stores structured and unstructured data at any scale. Data lakes allow data to be centralised and made available for analysis before the purpose of the analysis is defined. The purpose of Lake Formation is to rapidly deploy a data lake as a data repository that includes built in:
- Security
- ETL (Extract Transform Load)
- Formating, for example Parquet, ORC
- ML to improve data quality
Lake Formation is built on top of AWS Glue, leveraging all the features of that service. During the set up you are taken through a series of options to guide Lake Formation to:
- Find the input data sources
- Set up the S3 data lake
- Move the data to the S3 data lake
- Crawl the data to determine it’s structure and build a data catalogue
- Perform ETL
- Set up security to protect the data.
Once these tasks are complete the data is ready for Machine Learning processing.
Amazon FSx for Lustre
FSx for Lustre is a high performance combination of S3 and SSD storage. Data is presented as files to the Machine Learning models. This enables processing to start immediately without the wait needed for S3 objects to be fully loaded. Lustre (lustre.org) is an open source parallel system supporting High Performance Computing, originally from a research project at the Carnegie Mellon University. FSx for Lustre is Amazon’s way of supercharging its storage, including S3. The features of FSx for Lustre are:
- high performance storage system
- low latency
- high throughput
- high IOPS
- multiple underlying storage types
FSx for Lustre can be linked to S3 for concurrent access to both S3 and Lustre high performance file system. S3 objects appear as files.
The Machine Learning use case for FSx for Lustre is for serving massive training data to SageMaker. The concurrent file store allows multiple compute instance to share and work on the data concurrently. FSx for Lustre integrates with SageMaker. The data is lazy loaded, eliminating the time cost of the initial download from S3. This also reduces the cost of accessing common objects for iterative jobs on the same datasets.
Video – Amazon FSx for Lustre, Persistent Storage Overview
This AWS video by Darryl Osborne introduces Amazon FSx for Lustre. The video is 8.49 minutes long, below are the timestamps for the subject covered:
- 0 – Amazon FSx for Lustre
- 1.11 – Amazon FSx for Lustre development options
- 1.37 – Amazon FSx for Lustre demo environment
- 4.44 – Amazon FSx for Lustre write test
- 5.40 – Amazon FSx for Lustre read test
- 6.40 – Amazon FSx for Lustre in memory cache test
- 8.06 – Git hub scripts
- 8.49 – End
EBS volumes
Elastic Block Store, EBS, volumes are the virtual version of a PC’s hard drive. Data is stored as files and fast access can be specified. The data can be backed via a snapshot for durability and it is also possible to set up RAID configurations. Provisioned capacity, called IOPS, can be expensive, but if you need it, it is there. With EBS you have a single virtual drive that has to be associated with a single EC2 instance. Instances created by SageMaker for SageMaker Notebooks are EC2 instance with EBS volumes.
EFS
EFS is the networked drive version of EBS. With EFS you have multiple EBS drives networked together so that data can be accessed by multiple compute instances. There is both a Standard EFS version and an infrequently accessed version, EFS IA. EFS IA allows you to save costs on store data files that are used less often.
Secondary data repositories

The data from these data repositories has to be loaded in a data repository that can be consumed by SageMaker. Typically this would be S3.
RDS
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database. There are many databases to choose from, both open source based (mySQL, PostgreSQL) and vendor owned (Oracle, Microsoft). AWS takes care of most of the admin and maintenance leaving user free to concentrate on using and getting benefit from the database.
Use cases:
- Data that is relational and structured
- Data warehouse
- On Line Transaction Processing
- Running relational joins and complex updates.
DynamoDB
DynamoDB is a no-SQL database where data is stored as key value pairs. it treats all data within it as being composed of a list of attributes and value. The data can be both structured and unstructured. It is fast and can handle a massive quantities of transactions. DynamoDB use cases are:
- Non-relational data
- Structured and less structured data
- Storing JSON documents
RedShift
Amazon Redshift is a fast massively scale-able data warehouse system. It stores structured data that can be accessed and manipulated by standard SQL. The columnar underlying storage allows complex analytical queries against massive data. RDS use cases are:
- Data warehouse
- Structured relational data
- Complex analytical queries.
Useful commands:
- The UNLOAD command is used to save a table to a set of files on S3.
- The COPY command is used to take data from an S3 bucket and place it in a RedShift table.
Redshift Spectrum
Redshift Spectrum is a feature of Redshift that allows SQL access to data stored in S3. SQL can be used to access data in both Redshift and Redshift Spectrum within the same query. To access the data is must be represented in a Catalogue, for example a Glue Database, or Athena Catalogue. Redshift Spectrum use cases are:
- Data lake
- Semi structured data
Timestream
Amazon Timestream is a serverless database for storing time series data, for example from IoT devices or log data. Data is stored and queried by time intervals. Access to data is very fast. Timestream use cases are:
- Use Timestream to identify trends, patterns and anomalies in time series data
DocumentDB
DocumentDB is a repository optimised for storing and querying JSON documents. It is Apache MongoDB hosted on AWS scalable infrastructure and is marketed by AWS as a way to migrate existing MongoDB instances on to AWS serverless infrastructure.
- Use to migrate DynamoDB to AWS
- Store JSON documents
- Non-relational data
- Less structured data
Data repositories compared
The data repositories have been compared in the following table by:
- Cost – particularly for large quantities of data
- Availability – how long does it take to make the data ready for processing
- Usability – can the preferred Machine Learning and pre-processing tools access the storage
This is a subjective comparison that may change dramatically depending on the application and size of the data.
Repository | Uses | Cost | Availability | Usability |
---|---|---|---|---|
S3 Standard and Lake Formation | General purpose | Low | Fast | Excellent |
S3 Intelligent-Tiering | Unknown or changing access | Lower | Fast | Excellent |
S3 Standard-IA and One-zone IA | Data infrequently accesses | Lower | Fast | Good |
S3 Glacier | Long term archive | very low | Very slow | Needs retrieval to S3 |
FSx for Lustre with S3 | High speed processing | High | Very Fast | Excellent |
EBS | High speed processing | Medium | Very Fast | Excellent |
EFS | High speed processing | Medium | Very Fast | Excellent |
RDS | Relational database | Medium | Fast | Needs ELT to S3 |
DynamoDB | No SQL JSON database | Low | Fast | Needs ELT to S3 |
RedShift | Data warehouse | Low | Fast | Needs ELT to S3 |
TimeStream | Time series optimised database | ??? | Fast | Needs ELT to S3 |
DocumentDB | JSON document database | Low | Fast | Needs ELT to S3 |
- Picking the Right Data Store for Your Workload
- The fast data availability of EBS and EFS is because the data can be used directly without being moved: AWS model access training data
Data structure types
Structured data is data that has a standard format for efficient access by software and humans a like. The data is typically tabular with rows and columns that clearly define attributes. Examples of structured data are:
- Excel files
- Relational databases
- Point of sale data
- Web forms
Unstructured data is information with no set data model or data that has not yet been ordered in a predefined way. Examples of unstructured data:
- Text files
- Video files
- Reports
- Images
Semi-structured data sits between structured and unstructured data. Semi-structured data has some attributes of both structured and unstructured data. Examples of semi-structured data are:
- JSON
- XML
- Zipped files
Repository data structures compared
The data repositories have been compared in the following table by:
- Data structure – the type of structure of the data
- File type that can be stored
- Data type that the data must be formatted
Repository | Data structure | File type | Data type |
---|---|---|---|
S3 | All | objects | All |
FSx for lustre | All | appear as files | All |
EBS | All | files | All |
EFS | All | files | All |
RDS | Structured | All | |
DynamoDB | Structured Semi-structured | Key value pairs JSON docs | |
DocumentDB | Structured Semi-structured | Key value pairs JSON docs | |
RedShift | Structured | Any | |
RedShift Spectrum | Semistructured | Any Parquet | |
Timestream | Structured (Schemerless) | Time series data | |
LakeFormation | Structured unstructured | Parquet |
Summary
S3 is king for Machine Learning because the data is in a form that SageMaker can ingest. However there are many repositories that may hold the raw data prior to extraction and processing to make it ready for Machine Learning. There are also some new data repositories that can be access during training to improve performance. This article has briefly described these repositories, their attributes and use cases relevant to Machine Learning.
Credits
- Photo by Tim Evans on Unsplash
- AWS icons: Downloaded from https://aws.amazon.com/architecture/icons/
- S3 Life Cycle icons:
- pesticide by DompbelStudio from the Noun Project
- Butterfly by Butterfly from the Noun Project
- Caterpillar by Juraj Sedlák from the Noun Project
- chrysalis by parkjisun from the Noun Project
Contains affiliate links. If you go to Whizlab’s website and make a purchase I may receive a small payment. The purchase price to you will be unchanged. Thank you for your support.
Whizlabs AWS Certified Machine Learning Specialty
Practice Exams with 271 questions, Video Lectures and Hands-on Labs from Whizlabs
Whizlab’s AWS Certified Machine Learning Specialty Practice tests are designed by experts to simulate the real exam scenario. The questions are based on the exam syllabus outlined by official documentation. These practice tests are provided to the candidates to gain more confidence in exam preparation and self-evaluate them against the exam content.
Practice test content
- Free Practice test – 15 questions
- Practice test 1 – 65 questions
- Practice test 2 – 65 questions
- Practice test 3 – 65 questions
Questions and answers
Whizlab’s AWS Certified Machine Learning Specialty course
- In Whizlabs AWS Machine Learning certification course, you will learn and master how to build, train, tune, and deploy Machine Learning (ML) models on the AWS platform.
- Whizlab’s Certified AWS Machine Learning Specialty practice tests offer you a total of 200+ unique questions to get a complete idea about the real AWS Machine Learning exam.
- Also, you get access to hands-on labs in this course. There are about 10 lab sessions that are designed to take your practical skills on AWS Machine Learning to the next level.

Course content
The course has 3 resources which can be purchased seperately, or together:
- 9 Practice tests with 271 questions
- Video course with 65 videos
- 9 hands on labs