Life Expectancy Data Cleaning

This project is designed to load and clean Eurostat life expectancy data from a provided CSV file. The purpose of the code is to clean the data, convert it into a more usable format, and export data for any given country/region as a CSV file. The structure of this project is as it follows:

└── life_expectancy
    └── data
    ├── tests
    ├── cleaning.py
    ├── README.md
└── pyproject.toml

All the code is in the cleaning.py file.

Introduction

Life expectancy data comes in formats typical for the issueing institution. It has to be preprocessed and cleaned before performing any analysis. This code provides a solution for loading, cleaning, and saving life expectancy data for a specific country/region. It reshapes the data for an easier analysis, corrects text formatting errors, handles missing values, and converts data types. It then exports the data for the selected country/region as a csv file.

Usage

To use the script, follow these steps:

Make sure you have the necessary requirements installed (see Requirements).
Download the raw life expectancy data file (eu_life_expectancy_raw.tsv) and place it in the life_expectancy/data directory.
Open a terminal or command prompt.
Navigate to the project directory.
Run the script using the following command, replacing REGION_NAME with the desired region's name in the string format (e.g., "PT" for Portugal):

python cleaning.py --region REGION_NAME

The cleaned and processed data will be saved as <REGION_NAME>_life_expectancy.csv in the data directory, with REGION_NAME equal to the input parameter.

Functionality

The script performs the following tasks:

Loading Data: The script loads the raw life expectancy data from a provided CSV file (eu_life_expectancy_raw.tsv).
Data Cleaning: It preprocesses the data by splitting a column containing multiple variables and cleaning column names.
Data Reshaping: The script melts the DataFrame to turn all year columns into a single year column for an easier analysis.
Flag Extraction: It separates the life expectancy values from the accompanying flags indicating the data provenance. The flags are currently not kept in the exported data, but might be useful in downstream analysis.
NaN Handling and Type Conversion: The script identifies NaN-like values in specified columns. It then converts all values into the appropriate data types. It also removes any rows with missing data.
Region Selection: It filters the data to keep only the rows corresponding to the specified region.
Saving Data: The filtered data is saved as <REGION_NAME>_life_expectancy.csv in the data directory.

Requirements

Python <=3.8
pandas
numpy

Installation

Clone or download this repository.

git clone [email protected]/majkah0/nos-lp-foundations-workspace.git

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
life_expectancy		life_expectancy
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Life Expectancy Data Cleaning

Table of Contents

Introduction

Usage

Functionality

Requirements

Installation

About

Releases

Packages

Contributors 2

Languages

majkah0/nos-lp-foundations-workspace

Folders and files

Latest commit

History

Repository files navigation

Life Expectancy Data Cleaning

Table of Contents

Introduction

Usage

Functionality

Requirements

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages