Skip to content

Latest commit

 

History

History
132 lines (97 loc) · 7.65 KB

README.md

File metadata and controls

132 lines (97 loc) · 7.65 KB

ds-geo-social-datascope

Visualisation and analysis skills by analysing geospatial datasets relating to human life and society. Tech stack includes R, tidyverse, ggplot2, plotly, xml2, rvest; SQLite; JavaScript, D3.js; HTML, and CSS.

The ultimate goal is to analyse the multiple datasets from various sources and build a single page website for displaying some beautiful graphics using D3.js. I did this is a few parts:

  1. Cleaning, organising and merging the datasets.
  2. Exploratory analysis and visualisations with ggplot2.
  3. Final plots and website; JavaScript, D3.js.

Project structure

You can find all preliminary analysis and work done with R in the R-analysis directory. In the website directory you'll find all work done with JavaScript and D3.js.

Website structure

Tech stack will include and work in the following way:

  1. ExpressJS framework - for server creation.
  2. SQL database - stores and manages data.
  3. Nginx as a reverse proxy - avoids exposing server and relays client requests.
  4. NodeJS server - runs SQL and ExpressJS server.
  5. JavaScript - used front end and back end.
  6. SCSS - use for styling; better than CSS for its modularity.
  7. HTML - front end code.
  8. D3.js - visualisation library.

Preprocessing datasets

Preprocessing these datasets was a rather tedious matter. Unfortunately these do not always come in a neat and ready to use format. Preprocessing includes, merging, removing trash data, reformating, and manipulating data to fit the tidy structure - note that these often are presented in multiple .csv files. Reading these in all at once and working from a list is essential:

fontes <- lapply(list.files("./data/united-states-of-america/per-county-votes-20-fontes/", full.names = TRUE), read.csv)
names(fontes) <- gsub(".csv", "", list.files("./data/united-states-of-america/per-county-votes-20-fontes/"), perl = TRUE)

The trick is to list the files and pass this to the read.csv function through lapply then name the objects in the list.

Some of the aforementioned preprocessing techniques are demonstrated in the following example. For more information consult the preprocessing scripts.

Gapminder

The data was taken from the gapminder website and github repositories: open-numbers/ddf--gapminder--fasttrack.

Unfortunately the datasets provided are quite messy, it is difficult to obtain the full dataset, and manual download is often necessary for obtain certain ones. Most data is avaialable in the linked repository, but some is only available by manual download. Moreover, this data presents itself in short format.

Reshaping the data into long format was done by the clean-data.Rmd script. There I use a list object to read in all data files, name the object and execute a custom algorithm for reshaping the data and passing the file name to a column name:

First the data is read and put into long format. This is held by a list object.

reshape_manual_data <- function(x) {
	shift_long <- function(x) {
        column_to_rownames(x, "country") %>%
        t() %>% as.data.frame() %>%
        rownames_to_column("year") %>%
        mutate(year = gsub("X", "", year)) %>%
        reshape2::melt() %>%
        return()
	}
	
	shifted <- lapply(x, shift_long)
	
	for(i in 1:length(x)) {
		colnames(shifted[[i]])[3] <- names(x)[i]
	}
	return(shifted)
}

This list object is now merged to produce a single dataset, NA are inserted where necessary to keep all rows.

data$manual <- Reduce(function(...) {
	merge(..., all = TRUE)
}, reshape_manual_data(data$manual))

Finally all the data is saved to the SQLite database found in the sql directory. I then use this database with JavaScript and D3.js to produce the website.

ggplot2 visualisations

Data sources

gapminder

US census data

Vote datasets