Python v R Series: Part 04: String Splits
Introduction
Splitting strings based on a delimited is easy in both R and Python. The examples were inspired by Matt Dancho and his incredible library of training videos on Business Science University.
In the R examples, we will be using the RStudio IDE. It is assumed that you have a project set up in the folder where the data files exist. For more information about getting a project set up in R Studio and the Python environment set up in VS Code, click the links below.
R: Split Strings Based on Delimiter
library(tidyverse) library(readxl) bikes <- read_excel("00_Data_Files/bikes.xlsx") bikeshops <- read_excel("00_Data_Files/bikeshops.xlsx") orderlines <- read_excel("00_Data_Files/orderlines.xlsx")
bikes bikeshops orderlines
bikes_data_joined <- orderlines %>% left_join(bikes, by=c("product.id" = "bike.id")) %>% left_join(bikeshops, by=c("customer.id" = "bikeshop.id")) bikes_data_joined %>% glimpse()
df <- bikes_data_joined %>%
## Split Description into 3 columns separate(description, c("category.1", "category.2", "frame.material"), " - ") %>% ## Split location into 2 columns separate(location, c("city", "state"), ", ") %>% glimpse()
Python: Split Strings Based on Delimiter
# 1.0 Load Libraries ---- # # Load Libraries # Core Python Data Analysis from numpy.core.defchararray import index import pandas as pd import numpy as np
bikes_df = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")
bikes_df = pd.read_excel("00_data_raw/bikes.xlsx")
bikeshops_df = pd.read_excel("00_data_raw/bikeshops.xlsx")
orderlines_df = pd.read_excel("00_data_raw/orderlines.xlsx")
bike_orderlines_joined_df = orderlines_df \ .drop(columns='Unnamed: 0', axis=1) \ .merge( right = bikes_df, how='left', left_on='product.id', right_on='bike.id' ) \ .merge( right=bikeshops_df, how = 'left', left_on='customer.id', right_on='bikeshop.id' )
bike_orderlines_joined_df
bike_orderlines_joined_df.info()
df = bike_orderlines_joined_df df.T
temp_df = df['description'].str.split(pat=' - ', expand = True) df['category.1'] = temp_df[0] df['category.2'] = temp_df[1] df['frame.material'] = temp_df[2] df.T
temp_df = df['location'].str.split(', ', n = 1, expand = True) df['city'] = temp_df[0] df['state'] = temp_df[1] df.T
Conclusion
The process of splitting strings between R and Python was again very similar. However, I do think that the R code was much more elegant compared to the Python code. In the next section, we will calculate and add new columns to our tables.
For the complete list of R v Python topics, click on the links below.