Skip to content

luliudata/price_paid_data_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

price_paid_data_pipeline

This repo is a simple data pipeline using Pyspark. It takes a CSV file, construct a new identifier property_id and group transactions by this new identifier. Then output a newline delimited JSON formatted file.

Data Source: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

Data Definition Information: https://www.gov.uk/guidance/about-the-price-paid-data#download-option

The input dataset used by this repo is the Part 1 yearly file for 2021 (see file pp-2021-part1.csv).

The output file part-00000-3bbeb519-143c-47a9-8aa1-46e6407fd6dc-c000.json can be found under spark-test-output folder.

Prerequisites

To run pyspark script locally:

Spark 2.3.3
Hadoop 2.7.1
Python 3.6

(Note that newer version of Spark are not very stable with Python 3.7+, the versions above work relatively stable.)

Please note that if you run the script in PyCharm, please do the following:

In Project structure -> Add content root -> add the path of the Spark python executable -> restart PyCharm.

To run Jupyter notebook script (Python 3.8):

pandas==1.2.4
numpy==1.20.1

Deployment

I have written an AWS Glue driver script (see glue_driver_spark.py), and deployed it to AWS Glue (the deployment to AWS is done manually due to time constraints).

Glue provides a managed ETL service that runs on a serverless Apache Spark environment. So it is ideal for our case, it allows us to focus on our ETL job and not worry about configuring and managing the underlying compute resources.

I have manually created a Glue job with the least-privilege policies from AWS console, the Glue script, input and output files are all stored in S3 buckets. The test execution looks good, and the newline delimited JSON formatted outputs are uploaded to S3 bucket as expected.

Below is a part of the job logs:

......
LogType:stdout
Log Upload Time:Wed Aug 18 15:04:40 +0000 2021
LogLength:3155
Log Contents:
2021-08-18 15:04:24,118 - root - INFO - Starting the data pipeline
2021-08-18 15:04:24,118 - root - INFO - Reading csv files from s3 path s3://ppd-paid-data-pipeline-input
2021-08-18 15:04:25,763 - root - INFO - Constructing a new identifier 'property_id'
2021-08-18 15:04:25,948 - root - INFO - Grouping transaction records based on 'property_id'
2021-08-18 15:04:26,053 - root - INFO - Writing json files to s3 path s3://ppd-paid-data-pipeline-output/output
End of LogType:stdout

Below is screenshots for the Glue job:

image

Two screenshots for the output bucket: (Note that the number pf partitions can be configured in the code.) image image

An example output file downloaded from output bucket can be found under folder 'example_output_from_aws'.

Future Work

Due to the time constraints, below are some future work to explore:

  • The deployment for data pipeline has not been automated, the artifact used by Glue job is manually uploaded to an S3 bucket, and pointed to the Glue job. This step can be automated and the artifact should be versioned.
  • All the AWS resources used here can be created and managed via Terraform (for testing purpose, those resources are created and edited manually via AWS console).
  • Except using Spark, Beam is also a good tool to explore.
  • Adding tests for the scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published