Introduction

Splitting strings based on a delimited is easy in both R and Python. The examples were inspired by Matt Dancho and his incredible library of training videos on Business Science University.

In the R examples, we will be using the RStudio IDE. It is assumed that you have a project set up in the folder where the data files exist. For more information about getting a project set up in R Studio and the Python environment set up in VS Code, click the links below.

Setting Up Projects in R

Setting Up Python VS Code Environment

R: Split Strings Based on Delimiter

In the previous articles, we detailed how to perform the following functions listed below. For more information, click on the links below.

01: Data Imports

02: Join Data

03: Change Data Types

library(tidyverse)
library(readxl)

bikes      <- read_excel("00_Data_Files/bikes.xlsx")
bikeshops  <- read_excel("00_Data_Files/bikeshops.xlsx")
orderlines <- read_excel("00_Data_Files/orderlines.xlsx")

bikes
bikeshops
orderlines

bikes_data_joined <- orderlines %>% 
  left_join(bikes, by=c("product.id" = "bike.id")) %>%
  left_join(bikeshops, by=c("customer.id" = "bikeshop.id"))

bikes_data_joined %>% glimpse()

After our data set is joined together, we want to split the description column into three separate columns. We also want to split up the location into both City and State. The code below accomplishes this task. We will also change the name of the R tibble to df. Although not necessary, the renaming will match the best practice implemented in the Python scripts.

df <- bikes_data_joined %>%

  ## Split Description into 3 columns
  separate(description, c("category.1", "category.2", "frame.material"), " - ") %>%
  ## Split location into 2 columns
  separate(location, c("city", "state"), ", ") %>%
  glimpse()

Python: Split Strings Based on Delimiter

In the previous articles, we detailed how to perform the following functions listed below. For more information, click on the links below.

01: Data Imports

02: Join Data

03: Change Data Types

# 1.0 Load Libraries ----

# # Load Libraries

# Core Python Data Analysis
from numpy.core.defchararray import index
import pandas as pd
import numpy as np

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

bikes_df      = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df  = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")

bike_orderlines_joined_df = orderlines_df \
    .drop(columns='Unnamed: 0', axis=1) \
    .merge(
        right = bikes_df,
        how='left',
        left_on='product.id',
        right_on='bike.id'
    ) \
    .merge(
        right=bikeshops_df,
        how = 'left',
        left_on='customer.id',
        right_on='bikeshop.id'
    )

bike_orderlines_joined_df

bike_orderlines_joined_df.info()

df = bike_orderlines_joined_df
df.T

Similar to our example in R we want to split the description column into three separate columns. We also want to split up the location into both City and State. The code below accomplishes this task. We also changed the name of the R tibble to df.

Working with Python data frames is a little different than R since we set up temporary data frame assignments to pass the functions through to each of the columns. See both examples below.

Part 1: Split the description column

temp_df = df['description'].str.split(pat=' - ', expand = True)

df['category.1'] = temp_df[0]
df['category.2'] = temp_df[1]
df['frame.material'] = temp_df[2]

df.T

Part 2: Split the location column

temp_df = df['location'].str.split(', ', n = 1, expand = True)

df['city'] = temp_df[0]
df['state'] = temp_df[1]

df.T

The final Python data frame results will match the R output listed above.

Conclusion

The process of splitting strings between R and Python was again very similar. However, I do think that the R code was much more elegant compared to the Python code. In the next section, we will calculate and add new columns to our tables.

For the complete list of R v Python topics, click on the links below.

01: Data Imports

02: Join Data
03: Change Data Types
04: String Splits
05: Calculate New Columns
06: Organize Columns
07: Rename Columns
08: Saving Data
09: Aggregations

Search This Blog

Data Exploration

Python v R Series: Part 04: String Splits

Introduction

R: Split Strings Based on Delimiter

Python: Split Strings Based on Delimiter

Conclusion

Popular posts from this blog

Website Time Series Anomaly Detection

MySQL Part 1: Getting MySQL Set Up in goormIDE

Do Popular Market Index Returns Follow a Normal Distribution?