Python v R Series: Part 02: Join Data

 


Introduction

Joining and merging data are straightforward to do in both R and Python.  The examples were inspired by Matt Dancho and his incredible library of training videos on Business Science University

In the R examples, we will be using the RStudio IDE.  It is assumed that you have a project set up in the folder where the data files exist. For more information about getting a project set up in R Studio and the Python environment set up in VS Code, click the links below.

In both examples, we will be importing and joining 3 different excel data sets.  Do not worry if you do not have the exact data sets to follow along.  You can simply use a data set of your own and follow along.


R: Joining Data

In the previous article, we discussed how to correctly import and view data.  The code below includes all the code from the last article. Click the link below for more details.


library(tidyverse)
library(readxl)

bikes      <- read_excel("00_Data_Files/bikes.xlsx")
bikeshops  <- read_excel("00_Data_Files/bikeshops.xlsx")
orderlines <- read_excel("00_Data_Files/orderlines.xlsx")

bikes
bikeshops
orderlines

The results of the data imports are listed below.














Now that all the data is appropriately imported, we can join the three tables together.  In the first set, we will join the product.id column from the orderlines table to the bike.id column in the bikes table.

In the second join, we will join the customer.id column from the orderlines table to the bikeshops.id column in the bikeshops table. We will call the final table bikes_data_joined.

bikes_data_joined <- orderlines %>% 
  left_join(bikes, by=c("product.id" = "bike.id")) %>%
  left_join(bikeshops, by=c("customer.id" = "bikeshop.id"))

bikes_data_joined %>% glimpse()








Python: Joining Data

In the previous article, we discussed how to correctly import and view data.  The code below includes all the code from the last article. Click the link below for more details.

01: Data Imports

# 1.0 Load Libraries ----

# # Load Libraries

# Core Python Data Analysis
from numpy.core.defchararray import index
import pandas as pd
import numpy as np

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

The data tables listed below give the same results as the R example above.










We will perform the same exact joins as we did in the R example above.  You will notice slight differences in the code, but the end result is precisely the same.

bike_orderlines_joined_df = orderlines_df \
    .drop(columns='Unnamed: 0', axis=1) \
    .merge(
        right = bikes_df,
        how='left',
        left_on='product.id',
        right_on='bike.id'
    ) \
    .merge(
        right=bikeshops_df,
        how = 'left',
        left_on='customer.id',
        right_on='bikeshop.id'
    )

bike_orderlines_joined_df
bike_orderlines_joined_df.info()



Conclusion

The joining process between R and Python was again almost identical.  Both the platforms make it incredibly easy to join the three data sets.  

For the complete list of R v Python topics, click on the links below.



Popular posts from this blog

MySQL Part 1: Getting MySQL Set Up in goormIDE

Do Popular Market Index Returns Follow a Normal Distribution?