Building Data Lake on AWS with S3, Glue and Athena

Author: Rudi Suryadi

Architecture - Diagram

undefined

Prerequisites:

Part 1 : Ingest and Storage

Download Sample Data from GitHubArchive

Create S3 Bucket

In this step we will navigate to S3 Console and create the S3 bucket used throughout this demo.

Login to AWS Console : https://console.aws.amazon.com/console/home?region=ap-southeast-1

Navigate to S3 Console & Create a new bucket in ap-southeast-1 region :

Upload Sample Data to S3 Bucket

In this step we will navigate to S3 Console and upload the sample data used in this lab.

You should have a folder structure similar to this

By now your S3 bucket should look like this

Part 2 : Catalog and Transform

Create IAM Role

In this step we will navigate to IAM Console & create a new Glue service role, this allows AWS Glue to access data sitting in S3 and create necessary entities in Glue catalog.

Create AWS Glue Crawlers

In this step, we will navigate to AWS Glue Console & create glue crawlers to discovery the newly ingested data in S3.

You should see a message : Crawler innovate-crawler was created to run on demand.

Wait for few minutes

Verify newly created tables in catalog

Navigate to Glue Catalog & explore the crawled data:

Query newly ingested data using Amazon Athena

SELECT id, created_at FROM "innovate-db"."data" where type = 'PullRequestEvent'

Transform data - write your ETL job

In this step you will convert the JSON files to parquet

Navigate to Glue Console and Transform your data: