08-spark-pipelines.Rmd

---
title: "Spark pipelines"
output: html_notebook
---

## Class catchup

```{r}
library(tidyverse)
library(sparklyr)
library(lubridate)
top_rows <- read.csv("/usr/share/flights/data/flight_2008_1.csv", nrows = 5)
file_columns <- top_rows %>%
  rename_all(tolower) %>%
  map(function(x) "character")
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local", config = conf, version = "2.0.0")
spark_flights <- spark_read_csv(
  sc,
  name = "flights",
  path = "/usr/share/flights/data/",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)
```

## 8.1 - Recreate the transformations 


1. Register a new table called *current* containing a sample of the base *flights* table
```{r}
model_data <- sdf_partition(
  tbl(sc, "flights"),
  training = 0.01,
  testing = 0.01,
  rest = 0.98
)
```

2. Recreate the `dplyr` code in the `cached_flights` variable from the previous unit
```{r}
pipeline_df <- model_data$training %>%
  mutate(
    arrdelay = ifelse(arrdelay == "NA", 0, arrdelay),
    depdelay = ifelse(depdelay == "NA", 0, depdelay)
  ) %>%
  select(
    month,
    dayofmonth,
    arrtime,
    arrdelay,
    depdelay,
    crsarrtime,
    crsdeptime,
    distance
  ) %>%
  mutate_all(as.numeric)
```

3. Create a new Spark pipeline
```{r}
flights_pipeline <- ml_pipeline(sc) %>%
  ft_dplyr_transformer(
    tbl = pipeline_df
  ) %>%
  ft_binarizer(
    input_col = "arrdelay",
    output_col = "delayed",
    threshold = 15
  ) %>%
  ft_bucketizer(
    input_col = "crsdeptime",
    output_col = "dephour",
    splits = c(0, 400, 800, 1200, 1600, 2000, 2400)
  ) %>%
  ft_r_formula(delayed ~ arrdelay + dephour) %>%
  ml_logistic_regression()

flights_pipeline
```

## 8.2 - Fit, evaluate, save

1. Fit (train) the pipeline's model
```{r}
model <- ml_fit(flights_pipeline, model_data$training)
model
```

2. Use the newly fitted model to perform predictions using `ml_transform()`
```{r}
predictions <- ml_transform(
  x = model,
  dataset = model_data$testing
)
```

3. Use `group_by()` to see how the model performed
```{r}
predictions %>%
  group_by(delayed, prediction) %>%
  tally()
```

4. Save the model into disk using `ml_save()`
```{r}
ml_save(model, "saved_model", overwrite = TRUE)
list.files("saved_model")
```
5. Save the pipeline using `ml_save()`
```{r}
ml_save(flights_pipeline, "saved_pipeline", overwrite = TRUE)
list.files("saved_pipeline")
```

6. Close the Spark session
```{r}
spark_disconnect(sc)
```

## 8.3 - Reload model

*Use the saved model inside a different Spark session*

1. Open a new Spark connection and reload the data
```{r}
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")
spark_flights <- spark_read_csv(
  sc,
  name = "flights",
  path = "/usr/share/flights/flights_2008.csv",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)
```

2. Use `ml_load()` to reload the model directly into the Spark session
```{r}
reload <- ml_load(sc, "saved_model")
reload
```


4.  Create a new table called *current*. It needs to pull today's flights
```{r}
library(lubridate)

current <- tbl(sc, "flights") %>%
  filter(
    month == !! month(now()),
    dayofmonth == !! day(now())
  )

show_query(current)
```

5.  Create a new table called *current*. It needs to pull today's flights
```{r}
head(current)
```

6. Run predictions against the new data set
```{r}
new_predictions <- ml_transform(
  x = reload,
  dataset = current
)

```

7. Get a quick count of expected delayed flights
```{r}
new_predictions %>%
  summarise(late_fligths = sum(prediction, na.rm = TRUE))
```

## 8.4 - Reload pipeline

1. Use `ml_load()` to reload the pipeline into the Spark session
```{r}
pipeline <- ml_load(sc, "saved_pipeline")
pipeline
```

2. Create a new sample data set using `sample_frac()`
```{r}
sample <- tbl(sc, "flights") %>%
  sample_frac(0.001) 
```

3. Re-fit the model using `ml_fit()` and the new sample data
```{r}
new_model <- ml_fit(pipeline, sample)
new_model
```

4. Save the newly fitted model 
```{r}
ml_save(new_model, "new_model", overwrite = TRUE)
list.files("new_model")
```

5. Disconnect from Spark
```{r}
spark_disconnect(sc)
```