Python v R Series: Part 09: Aggregations
Introduction
Aggregating data in R and Python is fairly similar. The examples were inspired by Matt Dancho and his incredible library of training videos on Business Science University.
In the R examples, we will be using the RStudio IDE. It is assumed that you have a project set up in the folder where the data files exist. For more information about getting a project set up in R Studio and the Python environment set up in VS Code, click the links below.
R: Aggregating Data
In the previous articles, we detailed how to perform the following functions listed below. For more information, click on the links below.
library(tidyverse) library(readxl) bikes <- read_excel("00_Data_Files/bikes.xlsx") bikeshops <- read_excel("00_Data_Files/bikeshops.xlsx") orderlines <- read_excel("00_Data_Files/orderlines.xlsx")
bikes bikeshops orderlines
bikes_data_joined <- orderlines %>% left_join(bikes, by=c("product.id" = "bike.id")) %>% left_join(bikeshops, by=c("customer.id" = "bikeshop.id")) bikes_data_joined %>% glimpse()
df <- bikes_data_joined %>%
## Split Description into 3 columns separate(description, c("category.1", "category.2", "frame.material"), " - ") %>% ## Split location into 2 columns separate(location, c("city", "state"), ", ", remove = FALSE) %>% glimpse()
df <- df %>% mutate(total.price = quantity * price) %>% glimpse()
df <- df %>% select(order.id, order.line, order.date, model, quantity, price, total.price, bikeshop.name, location, category.1, category.2, frame.material, city, state) %>% glimpse()
names(df)
names(df) <- str_replace_all(names(df), "\\.", "_") df %>% glimpse()
df_2 <- df
Now that the data is saved to a new object, we can utilize various columns to do some analysis. In our example below, we aggregate total_price by state and sort from highest to lowest.
Python: Aggregating Data
In the previous articles, we detailed how to perform the following functions listed below. For more information, click on the links below.
# 1.0 Load Libraries ---- # # Load Libraries # Core Python Data Analysis from numpy.core.defchararray import index import pandas as pd import numpy as np
bikes_df = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")
bikes_df = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")
bike_orderlines_joined_df = orderlines_df \ .drop(columns='Unnamed: 0', axis=1) \ .merge( right = bikes_df, how='left', left_on='product.id', right_on='bike.id' ) \ .merge( right=bikeshops_df, how = 'left', left_on='customer.id', right_on='bikeshop.id' )
bike_orderlines_joined_df
bike_orderlines_joined_df.info()
df = bike_orderlines_joined_df df.T
temp_df = df['description'].str.split(pat=' - ', expand = True) df['category.1'] = temp_df[0] df['category.2'] = temp_df[1] df['frame.material'] = temp_df[2] df.T
temp_df = df['location'].str.split(', ', n = 1, expand = True) df['city'] = temp_df[0] df['state'] = temp_df[1] df.T
df['total.price'] = df['quantity'] * df['price'] df.T
df.columns cols_to_keep_list = [ 'order.id', 'order.line', 'order.date', 'model', 'quantity', 'price', 'total.price', 'bikeshop.name', 'location', 'category.1', 'category.2', 'frame.material', 'city', 'state' ] cols_to_keep_list df = df[cols_to_keep_list] df.T
df.columns = df.columns.str.replace(".", "_") df.T
df_2 = df.copy()
Similar to the R example above, after the data is saved to a new object, we can utilize various columns to do some analysis. In our example below, we aggregate total_price by state and sort from highest to lowest.
sales_by_state_df = df.groupby('state') \ .agg(total_price_sum = ('total_price', sum)) \ .reset_index() sales_by_state_df.sort_values('total_price_sum', ascending=False)
Conclusion
The process of aggregating data in R and Python was very similar. This concludes this series of data wrangling and manipulation.
For the complete list of R v Python topics, click on the links below.