Python v R Series: Part 04: String Splits

 


Introduction

Splitting strings based on a delimited is easy in both R and Python.  The examples were inspired by Matt Dancho and his incredible library of training videos on Business Science University

In the R examples, we will be using the RStudio IDE.  It is assumed that you have a project set up in the folder where the data files exist. For more information about getting a project set up in R Studio and the Python environment set up in VS Code, click the links below.


R: Split Strings Based on Delimiter

In the previous articles, we detailed how to perform the following functions listed below.  For more information, click on the links below.


library(tidyverse)
library(readxl)

bikes      <- read_excel("00_Data_Files/bikes.xlsx")
bikeshops  <- read_excel("00_Data_Files/bikeshops.xlsx")
orderlines <- read_excel("00_Data_Files/orderlines.xlsx")

bikes
bikeshops
orderlines

bikes_data_joined <- orderlines %>% 
  left_join(bikes, by=c("product.id" = "bike.id")) %>%
  left_join(bikeshops, by=c("customer.id" = "bikeshop.id"))

bikes_data_joined %>% glimpse()










After our data set is joined together, we want to split the description column into three separate columns.  We also want to split up the location into both City and State.  The code below accomplishes this task.  We will also change the name of the R tibble to df.  Although not necessary, the renaming will match the best practice implemented in the Python scripts.

df <- bikes_data_joined %>%
  ## Split Description into 3 columns
  separate(description, c("category.1", "category.2", "frame.material"), " - ") %>%
  ## Split location into 2 columns
  separate(location, c("city", "state"), ", ") %>%
  glimpse()










Python: Split Strings Based on Delimiter 

In the previous articles, we detailed how to perform the following functions listed below.  For more information, click on the links below.


# 1.0 Load Libraries ----

# # Load Libraries

# Core Python Data Analysis
from numpy.core.defchararray import index
import pandas as pd
import numpy as np

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

bike_orderlines_joined_df = orderlines_df \
    .drop(columns='Unnamed: 0', axis=1) \
    .merge(
        right = bikes_df,
        how='left',
        left_on='product.id',
        right_on='bike.id'
    ) \
    .merge(
        right=bikeshops_df,
        how = 'left',
        left_on='customer.id',
        right_on='bikeshop.id'
    )

bike_orderlines_joined_df
bike_orderlines_joined_df.info()

df = bike_orderlines_joined_df
df.T


Similar to our example in R we want to split the description column into three separate columns.  We also want to split up the location into both City and State.  The code below accomplishes this task.  We also changed the name of the R tibble to df.  

Working with Python data frames is a little different than R since we set up temporary data frame assignments to pass the functions through to each of the columns.  See both examples below.

Part 1: Split the description column
temp_df = df['description'].str.split(pat=' - ', expand = True)

df['category.1'] = temp_df[0]
df['category.2'] = temp_df[1]
df['frame.material'] = temp_df[2]

df.T

Part 2: Split the location column
temp_df = df['location'].str.split(', ', n = 1, expand = True)

df['city'] = temp_df[0]
df['state'] = temp_df[1]

df.T

The final Python data frame results will match the R output listed above.


Conclusion

The process of splitting strings between R and Python was again very similar.  However, I do think that the R code was much more elegant compared to the Python code.  In the next section, we will calculate and add new columns to our tables.

For the complete list of R v Python topics, click on the links below.


Popular posts from this blog

MySQL Part 1: Getting MySQL Set Up in goormIDE

Do Popular Market Index Returns Follow a Normal Distribution?