This project is designed to load and clean Eurostat life expectancy data from a provided CSV file. The purpose of the code is to clean the data, convert it into a more usable format, and export data for any given country/region as a CSV file. The structure of this project is as it follows:
└── life_expectancy
└── data
├── tests
├── cleaning.py
├── README.md
└── pyproject.toml
All the code is in the cleaning.py
file.
Life expectancy data comes in formats typical for the issueing institution. It has to be preprocessed and cleaned before performing any analysis. This code provides a solution for loading, cleaning, and saving life expectancy data for a specific country/region. It reshapes the data for an easier analysis, corrects text formatting errors, handles missing values, and converts data types. It then exports the data for the selected country/region as a csv file.
To use the script, follow these steps:
- Make sure you have the necessary requirements installed (see Requirements).
- Download the raw life expectancy data file (
eu_life_expectancy_raw.tsv
) and place it in thelife_expectancy/data
directory. - Open a terminal or command prompt.
- Navigate to the project directory.
- Run the script using the following command, replacing
REGION_NAME
with the desired region's name in the string format (e.g., "PT" for Portugal):
python cleaning.py --region REGION_NAME
The cleaned and processed data will be saved as <REGION_NAME>_life_expectancy.csv
in the data
directory, with REGION_NAME
equal to the input parameter.
The script performs the following tasks:
-
Loading Data: The script loads the raw life expectancy data from a provided CSV file (
eu_life_expectancy_raw.tsv
). -
Data Cleaning: It preprocesses the data by splitting a column containing multiple variables and cleaning column names.
-
Data Reshaping: The script melts the DataFrame to turn all year columns into a single year column for an easier analysis.
-
Flag Extraction: It separates the life expectancy values from the accompanying flags indicating the data provenance. The flags are currently not kept in the exported data, but might be useful in downstream analysis.
-
NaN Handling and Type Conversion: The script identifies NaN-like values in specified columns. It then converts all values into the appropriate data types. It also removes any rows with missing data.
-
Region Selection: It filters the data to keep only the rows corresponding to the specified region.
-
Saving Data: The filtered data is saved as
<REGION_NAME>_life_expectancy.csv
in thedata
directory.
- Python <=3.8
- pandas
- numpy
- Clone or download this repository.
git clone [email protected]/majkah0/nos-lp-foundations-workspace.git