Introduction

Aggregating data in R and Python is fairly similar. The examples were inspired by Matt Dancho and his incredible library of training videos on Business Science University.

In the R examples, we will be using the RStudio IDE. It is assumed that you have a project set up in the folder where the data files exist. For more information about getting a project set up in R Studio and the Python environment set up in VS Code, click the links below.

Setting Up Projects in R

Setting Up Python VS Code Environment

R: Aggregating Data

In the previous articles, we detailed how to perform the following functions listed below. For more information, click on the links below.

01: Data Imports

02: Join Data
03: Change Data Types
04: String Splits
05: Calculate New Columns
06: Organize Columns
07: Rename Columns
08: Saving Data

library(tidyverse)
library(readxl)

bikes      <- read_excel("00_Data_Files/bikes.xlsx")
bikeshops  <- read_excel("00_Data_Files/bikeshops.xlsx")
orderlines <- read_excel("00_Data_Files/orderlines.xlsx")

bikes
bikeshops
orderlines

bikes_data_joined <- orderlines %>% 
  left_join(bikes, by=c("product.id" = "bike.id")) %>%
  left_join(bikeshops, by=c("customer.id" = "bikeshop.id"))

bikes_data_joined %>% glimpse()

df <- bikes_data_joined %>%

  ## Split Description into 3 columns
  separate(description, c("category.1", "category.2", "frame.material"), " - ") %>%
  ## Split location into 2 columns
  separate(location, c("city", "state"), ", ", remove = FALSE) %>%
  glimpse()

df <- df %>% 
  mutate(total.price = quantity * price) %>% 
  glimpse()

df <- df %>%
  select(order.id,
         order.line,
         order.date,
         model,
         quantity,
         price,
         total.price,
         bikeshop.name,
         location,
         category.1,
         category.2,
         frame.material,
         city,
         state) %>%
  glimpse()

names(df)

names(df) <- str_replace_all(names(df), "\\.", "_")

df %>% glimpse()

df_2 <- df

Now that the data is saved to a new object, we can utilize various columns to do some analysis. In our example below, we aggregate total_price by state and sort from highest to lowest.

Python: Aggregating Data

In the previous articles, we detailed how to perform the following functions listed below. For more information, click on the links below.

01: Data Imports

02: Join Data
03: Change Data Types
04: String Splits
05: Calculate New Columns
06: Organize Columns
07: Rename Columns
08: Saving Data

# 1.0 Load Libraries ----

# # Load Libraries

# Core Python Data Analysis
from numpy.core.defchararray import index
import pandas as pd
import numpy as np

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

bike_orderlines_joined_df = orderlines_df \
    .drop(columns='Unnamed: 0', axis=1) \
    .merge(
        right = bikes_df,
        how='left',
        left_on='product.id',
        right_on='bike.id'
    ) \
    .merge(
        right=bikeshops_df,
        how = 'left',
        left_on='customer.id',
        right_on='bikeshop.id'
    )

bike_orderlines_joined_df

bike_orderlines_joined_df.info()

df = bike_orderlines_joined_df
df.T

temp_df = df['description'].str.split(pat=' - ', expand = True)

df['category.1'] = temp_df[0]
df['category.2'] = temp_df[1]
df['frame.material'] = temp_df[2]

df.T

temp_df = df['location'].str.split(', ', n = 1, expand = True)

df['city'] = temp_df[0]
df['state'] = temp_df[1]

df.T

df['total.price'] = df['quantity'] * df['price']

df.T

df.columns

cols_to_keep_list = [
    'order.id', 
    'order.line', 
    'order.date', 
    'model',  
    'quantity',
    'price', 
    'total.price',
    'bikeshop.name', 
    'location', 
    'category.1', 
    'category.2',
    'frame.material', 
    'city', 
    'state'
]

cols_to_keep_list

df = df[cols_to_keep_list]

df.T

df.columns = df.columns.str.replace(".", "_")

df.T

df_2 = df.copy()

Similar to the R example above, after the data is saved to a new object, we can utilize various columns to do some analysis. In our example below, we aggregate total_price by state and sort from highest to lowest.

sales_by_state_df = df.groupby('state') \
    .agg(total_price_sum = ('total_price', sum)) \
    .reset_index()

sales_by_state_df.sort_values('total_price_sum', ascending=False)

Conclusion

The process of aggregating data in R and Python was very similar. This concludes this series of data wrangling and manipulation.

For the complete list of R v Python topics, click on the links below.

01: Data Imports

02: Join Data
03: Change Data Types
04: String Splits
05: Calculate New Columns
06: Organize Columns
07: Rename Columns
08: Saving Data
09: Aggregations

Search This Blog

Data Exploration

Python v R Series: Part 09: Aggregations

Introduction

R: Aggregating Data

Python: Aggregating Data

Conclusion

Popular posts from this blog

Website Time Series Anomaly Detection

MySQL Part 1: Getting MySQL Set Up in goormIDE

Do Popular Market Index Returns Follow a Normal Distribution?