Obtaining large specialised datasets is a must to experiment and train Machine Learning models so they can recognise patterns in real world data and infer a prediction. Datasets can also be used as a source of labeled data to train models to generalise unlabeled real world data. Fortunately there are many data sources for datasets that are open and free to use. These datasets are stored in libraries with metadata that can be searched to find the ideal dataset for the problem being solved.
Create data repositories for machine learning is task statement 1.1 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam guide
Task Statement 1.1: Create data repositories for ML.AWS Machine Learning – Speciality exam guide
– Identify data sources (for example, content and location, primary sources
such as user data).
– Determine storage mediums (for example, databases, Amazon S3, Amazon
Elastic File System [Amazon EFS], Amazon Elastic Block Store [Amazon
Metadata is data about data. In this case it allows datasets in libraries to be searched and identified. Each library has it’s own way of classifying and describing the datasets. These are typical fields:
- Description – a word description of the general contents of the dataset
- Format – for example JSON, CSV
- Content – a list of the fields in the dataset
- Location, or how to retrieve the dataset
Libraries of datasets are maintained by all the major cloud vendors and some idependent sources.
AWS hosts datasets in the AWS Marketplace.
Google Cloud Public Datasets
Kaggle describes it’s dataset library as: share, stress test, and stay up-to-date on all the latest ML techniques and technologies. Discover a huge repository of community-published models, data & code for your next project.
Hugging face describes itself as the platform where the machine learning community collaborates on models, datasets, and applications.