Project Description

Anomaly detection can be valuable in many ways. For example, anomalies can detect fraud, better understand system health monitoring, or help data engineers identify spikes in website traffic. It can also be used to remove extreme outliers from datasets before modeling.

In this project, webpage traffic outliers were identified that utilized data from the calendar year 2016. The project focused on popular United States sports, entertainment, and political events and people.

This project applied the anomalize package to assist in identifying outliers.

Interactive Tableau Public Viz

A Tableau interactive viz was created that used the final output produced in this project. The interactive viz can be viewed by clicking the link below.

Website Anomaly Detection App

Load Libraries

To get started, install/load the libraries listed below.

# 1.0 LIBRARIES ----
library(vroom)
library(tidyverse)
library(tidyquant)
library(lubridate)
library(rsample)
library(anomalize)
library(fuzzyjoin)
library(readxl)

Load Data

The original raw web data file included in the project can be downloaded here. The raw data was modified in a previous offline project since the original raw data file was huge (271 MB). I also created a couple lookup tables that make valuable features later in the pipeline.

# 2.0 LOAD DATA ----
websites_sample_tbl <- vroom::vroom("Data_Sources/2020_06_15_Anomalize/websites_sample_tbl.csv", delim = ",")
page_summary_lkp_tbl <- read_xlsx("Data_Sources/2020_06_15_Anomalize/Topics_Dates_LKP.xlsx", sheet = "Sheet1")
topics_dates_lkp_tbl <- read_xlsx("Data_Sources/2020_06_15_Anomalize/Topics_Dates_LKP.xlsx", sheet = "Sheet2")

Subsets of the three (3) tibbles are listed below. The tibbles include website traffic visits, the topic summary, details about the outliers, and the sources and links to the details.

Finding Anomalies

A few handy functions in the anomalize package were included in the analysis. The official descriptions contained in the site above are listed below.

time_decompose(): Separates the time series into seasonal, trend, and remainder components
anomalize(): Applies anomaly detection methods to the remainder component.
recomposed_l1 & recomposed_l2: added to calculate the limits that separate the “normal” data from the anomalies.

## ANOMALY DETECTION -----
websites_anomalies_tbl <- websites_sample_tbl %>%
    group_by(Page_Summary) %>%
    filter(is.na(date) == FALSE) %>%
    time_decompose(visits, method = "stl") %>%
    anomalize(remainder, method = "iqr", alpha = 0.014) %>%
    mutate(recomposed_l1 = season + trend + remainder_l1) %>%
    mutate(recomposed_l2 = season + trend + remainder_l2)

Finally, an example of the anomalies viewed in R is listed below. The anomalize package includes a plot_anomalies() function that quickly identifies the outliers.

In 2016, the Chicago Cubs were in the playoffs and eventually won the World Series. The popularity of that event is obviously apparent in the visual below.

websites_anomalies_tbl %>%
  filter(Page_Summary == "Chicago_Cubs") %>%
  filter(date >= as.Date("2016-01-01")) %>%
  plot_anomalies(ncol = 2, time_recomposed = TRUE)

To view the fully interactive Tableau Public viz with all 2016 popular topics listed, please click here.

Final Tidy Tibble With Proper Titles & Descriptions

Finally, we can add the web page Category, Proper Title, Description of the Outlier Event, and the Source. This tibble is what is exported and loaded into the Tableau Public viz.

websites_tidy_tbl <- websites_anomalies_tbl %>%
    left_join(page_summary_lkp_tbl) %>%
    left_join(topics_dates_lkp_tbl)

Search This Blog

Data Exploration

Website Time Series Anomaly Detection

Project Description

Interactive Tableau Public Viz

Load Libraries

Load Data

Finding Anomalies

Final Tidy Tibble With Proper Titles & Descriptions

Popular posts from this blog

MySQL Part 1: Getting MySQL Set Up in goormIDE

Do Popular Market Index Returns Follow a Normal Distribution?