README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  dpi = 300,
  message = F,
  warning = F
)

devtools::load_all()
library(tidyverse)
```


# anomalize <img src="man/figures/anomalize-logo.png" width="147" height="170" align="right" />

[![Travis build status](https://travis-ci.org/business-science/anomalize.svg?branch=master)](https://travis-ci.org/business-science/anomalize)
[![Coverage status](https://codecov.io/gh/business-science/anomalize/branch/master/graph/badge.svg)](https://codecov.io/github/business-science/anomalize?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/anomalize)](https://cran.r-project.org/package=anomalize)
![](http://cranlogs.r-pkg.org/badges/anomalize?color=brightgreen)
![](http://cranlogs.r-pkg.org/badges/grand-total/anomalize?color=brightgreen)


> Tidy anomaly detection

`anomalize` enables a tidy workflow for detecting anomalies in data. The main functions are `time_decompose()`, `anomalize()`, and `time_recompose()`. When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data.

## Anomalize In 2 Minutes (YouTube)

<a href="https://www.youtube.com/watch?v=Gk_HwjhlQJs" target="_blank"><img src="http://img.youtube.com/vi/Gk_HwjhlQJs/0.jpg" 
alt="Anomalize" width="100%" height="350"/></a>

Check out our entire [Software Intro Series](https://www.youtube.com/watch?v=Gk_HwjhlQJs&list=PLo32uKohmrXsYNhpdwr15W143rX6uMAze) on YouTube!

## Installation

You can install the development version with `devtools` or the most recent CRAN version with `install.packages()`:

``` r
# devtools::install_github("business-science/anomalize")
install.packages("anomalize")
```

## How It Works

`anomalize` has three main functions:

- `time_decompose()`: Separates the time series into seasonal, trend, and remainder components
- `anomalize()`: Applies anomaly detection methods to the remainder component.
- `time_recompose()`: Calculates limits that separate the "normal" data from the anomalies!

## Getting Started

Load the `tidyverse` and `anomalize` packages.

```{r, eval = F}
library(tidyverse)
library(anomalize)
```


Next, let's get some data.  `anomalize` ships with a data set called `tidyverse_cran_downloads` that contains the daily CRAN download counts for 15 "tidy" packages from 2017-01-01 to 2018-03-01.

```{r tidyverse_plot_1, fig.height=5}
tidyverse_cran_downloads %>%
    ggplot(aes(date, count)) +
    geom_point(color = "#2c3e50", alpha = 0.25) +
    facet_wrap(~ package, scale = "free_y", ncol = 3) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
    labs(title = "Tidyverse Package Daily Download Counts",
         subtitle = "Data from CRAN by way of cranlogs package")
```

Suppose we want to determine which daily download "counts" are anomalous. It's as easy as using the three main functions (`time_decompose()`, `anomalize()`, and `time_recompose()`) along with a visualization function, `plot_anomalies()`.

```{r tidyverse_anoms_1, fig.height=8}
tidyverse_cran_downloads %>%
    # Data Manipulation / Anomaly Detection
    time_decompose(count, method = "stl") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
    labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods") 
```

If you're familiar with Twitter's `AnomalyDetection` package, you can implement that method by combining `time_decompose(method = "twitter")` with `anomalize(method = "gesd")`. Additionally, we'll adjust the `trend = "2 months"` to adjust the median spans, which is how Twitter's decomposition method works.

```{r}
# Get only lubridate downloads
lubridate_dloads <- tidyverse_cran_downloads %>%
    filter(package == "lubridate") %>% 
    ungroup()

# Anomalize!!
lubridate_dloads %>%
    # Twitter + GESD
    time_decompose(count, method = "twitter", trend = "2 months") %>%
    anomalize(remainder, method = "gesd") %>%
    time_recompose() %>%
    # Anomaly Visualziation
    plot_anomalies(time_recomposed = TRUE) +
    labs(title = "Lubridate Anomalies", subtitle = "Twitter + GESD Methods")
    
```

Last, we can compare to STL + IQR methods, which use different decomposition and anomaly detection approaches.

```{r}
lubridate_dloads %>%
    # STL + IQR Anomaly Detection
    time_decompose(count, method = "stl", trend = "2 months") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE) +
    labs(title = "Lubridate Anomalies", subtitle = "STL + IQR Methods")
```

## But Wait, There's More!

There are a several extra capabilities:

- `time_frequency()` and `time_trend()` for generating frequency and trend spans using date and datetime information, which is more intuitive than selecting numeric values. Also, `period = "auto"` automatically selects frequency and trend spans based on the time scale of the data.

```{r, message = T}
# Time Frequency
time_frequency(lubridate_dloads, period = "auto")
```

```{r, message = T}
# Time Trend
time_trend(lubridate_dloads, period = "auto")
```


- `plot_anomaly_decomposition()` for visualizing the inner workings of how algorithm detects anomalies in the "remainder". 

```{r, fig.height=7}
tidyverse_cran_downloads %>%
    filter(package == "lubridate") %>%
    ungroup() %>%
    time_decompose(count) %>%
    anomalize(remainder) %>%
    plot_anomaly_decomposition() +
    labs(title = "Decomposition of Anomalized Lubridate Downloads")
```

<!-- - Other `time_decompose` methods: In addition to "stl" and "twitter", we have added "multiplicative" for time series with non-constant variance. Word of caution that statistical transformations such as logarithmic or power transformations may perform better. -->

- Vector functions for anomaly detection: `iqr()` and `gesd()`. These are great for just using on numeric data. Note that trend and seasonality should already be removed for time series data.

```{r}
# Data with outliers
set.seed(100)
x <- rnorm(100)
idx_outliers <- sample(100, size = 5)
x[idx_outliers] <- x[idx_outliers] + 10

# IQR method
iqr(x, alpha = 0.05, max_anoms = 0.2)
```

- Anomaly Reports: Using `verbose = TRUE`, we can return a nice report of useful information related to the outliers.

```{r}
lubridate_dloads %>%
    time_decompose(count) %>%
    anomalize(remainder, verbose = TRUE)
```


## References

Several other packages were instrumental in developing anomaly detection methods used in `anomalize`:

- Twitter's `AnomalyDetection`, which implements decomposition using median spans and the Generalized Extreme Studentized Deviation (GESD) test for anomalies.
- `forecast::tsoutliers()` function, which implements the IQR method.