PySpark Advanced DataFrame Concepts

This project provides a Docker-based setup to explore advanced PySpark DataFrame concepts using Jupyter notebooks. The environment includes all necessary dependencies, making it easy to get started with PySpark for data processing and analysis.

Project Overview

Jupyter Notebooks: Interactive notebooks to experiment with PySpark code.
PySpark: A Python library for Spark, used for large-scale data processing.
Docker: Containerization tool to ensure a consistent development environment.

Setup Instructions

Follow these steps to set up and run the project on your local machine.

1. Prerequisites

Ensure you have the following installed on your machine:

Docker (for containerized development)
Git (for cloning the repository)

2. Clone the Repository

Clone this GitHub repository to your local machine:

git clone https://github.com/your_username/PySpark_Advanced_DataFrame_Concepts.git
cd PySpark_Advanced_DataFrame_Concepts


### 3. Download the NYC Taxi Trip Data

This project uses the NYC Taxi Trip Data in Parquet format as an example dataset. You can download the dataset from the [NYC Taxi & Limousine Commission website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

#### Download the Parquet file(s):
- Choose the relevant data files in Parquet format.

#### Place the files in the datasets folder:
- After downloading, place the Parquet files in the following directory:

```bash
notebooks/datasets

4. Build and Run the Docker Container

This project uses the official jupyter/pyspark-notebook Docker image. The docker-compose.yml file is included for easy setup.

Start Docker Compose

Build and run the container using Docker Compose:
```
docker-compose up
```
Access Jupyter Lab:
- Once the container is running, open your web browser and navigate to http://localhost:8888. Use the token provided in the terminal to log in.

Stop Docker Compose

To stop the running Docker containers, use the following command:

docker-compose down

5. Start Exploring

Once the container is running, open Jupyter Lab at http://localhost:8888 in your web browser. You'll find the notebooks inside the /home/jovyan/work/notebooks directory.

Project Structure

/notebooks: Contains Jupyter notebooks for various PySpark DataFrame operations.
/notebooks/datasets: Directory for datasets used in the notebooks.
Dockerfile: Defines the Docker image for this project (optional if using jupyter/pyspark-notebook directly).
docker-compose.yml: Docker Compose file to manage the container setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PySpark Advanced DataFrame Concepts

Project Overview

Setup Instructions

1. Prerequisites

2. Clone the Repository

4. Build and Run the Docker Container

Start Docker Compose

Stop Docker Compose

5. Start Exploring

Project Structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

PySpark Advanced DataFrame Concepts

Project Overview

Setup Instructions

1. Prerequisites

2. Clone the Repository

4. Build and Run the Docker Container

Start Docker Compose

Stop Docker Compose

5. Start Exploring

Project Structure