a photograph of sauce bottles to symbolise data sources

Identify data sources

Obtaining large specialised datasets is a must to experiment and train Machine Learning models so they can recognise patterns in real world data and infer a prediction. Datasets can also be used as a source of labeled data to train models to generalise unlabeled real world data. Fortunately there are many data sources for datasets that are open and free to use. These datasets are stored in libraries with metadata that can be searched to find the ideal dataset for the problem being solved.

Create data repositories for machine learning is task statement 1.1 of the Data Engineering knowledge domain. For more information about the exam structure see: AWS Machine Learning exam guide

Task Statement 1.1: Create data repositories for ML.
– Identify data sources (for example, content and location, primary sources
such as user data).
Determine storage mediums (for example, databases, Amazon S3, Amazon
Elastic File System [Amazon EFS], Amazon Elastic Block Store [Amazon
EBS]).

AWS Machine Learning – Speciality exam guide

Metadata

Metadata is data about data. In this case it allows datasets in libraries to be searched and identified. Each library has it’s own way of classifying and describing the datasets. These are typical fields:

  • Title
  • Description – a word description of the general contents of the dataset
  • Format – for example JSON, CSV
  • Size
  • Content – a list of the fields in the dataset
  • Location, or how to retrieve the dataset

Example libraries

Libraries of datasets are maintained by all the major cloud vendors and some idependent sources.

AWS

AWS hosts datasets in the AWS Marketplace.

Google Cloud Public Datasets

Microsoft

Kaggle

Kaggle describes it’s dataset library as: share, stress test, and stay up-to-date on all the latest ML techniques and technologies. Discover a huge repository of community-published models, data & code for your next project.

Hugging Face

Hugging face describes itself as the platform where the machine learning community collaborates on models, datasets, and applications.

Credits

Photo by Eric Prouzet on Unsplash

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *