In today’s how-to blog post, we will learn how to split any column, no matter the delimiter, in just two step. To follow along, download the dataset here.
The goal: splitting a column with three different delimiters
In my New York Times Best Sellers List dataset, I would like to split the Author column so that each cells includes only one name. For instance, if James Patterson and Michael Ledwidge wrote “The Wife between us” together:
I’ll want to produce one line for James and one for Michael:
The challenge: word delimiters
Unfortunately, our Author column is not evenly formatted: in some columns author names are divided by commas, in others by commas and the “and” conjugation, in others by the preposition “with”.
Since we need to create a row for each author-book combination, word delimiters such as “and” and “with” are particularly annoying here:
- they cannot be used as delimiters in the Text to Column Tool;
- they can be used as delimiters the RegEx Tool, but the RegEx Tool does not have a split to row option.
A tedious and lengthy solution
Your first instinct might be that of using several parsing tools, at least of for each different delimiter. If so, you are not alone!
When first tackling this challenge, I jumped into building the long workflow with this mindset. However, due to the challenges with word delimiters such as “and” and “with” mentioned above, this turned out to be a tedious job.
In practice, this meant that:
- I could split the column to rows when the delimiter was a comma…
- …but I had to use the RegEx Tool for “and” and “with”. Since the RegEx tool does not have a split to row option, I had to create one column per name. Sometimes, the names would be 2, others 4, creating semi-empty columns.
- After that, I needed to use a Transpose Tool
- and finally to get rid of the rows where the author name was null.
And throughout all this, I kept using Select Tools to rename the newly created columns in order not to lose sight of which column was which.
By the end of it, I had a headache!
A short and sweet solution
After sleeping over it, it dawned on me: I can actually split this column in just two steps (plus a Cleansing Tool)!
STEP 1: The Replace() function is very useful in these situations.
Just pick one of the three delimiters, and replace the other two with it. You need to pick a delimiter that can be used in the Text to Column tool. In our case, we can only pick the comma, since “and” and “with” cannot be used as delimiters in the text to column tool. If none of the delimiter in your column can be used in the Text to Column too, substitute all of them with one that is (e.g. commas, dashes).
Extra tip: Make sure to add a space around “with” and “and”, as in the example above. Otherwise, author names that contain these sequences of letters (e.g. Candance) will also be altered.
STEP 2: Drag in a Text to Column Tool, pick the comma as a delimiter and select Split to rows.
Extra tip: It is advisable to add a cleansing tool to trim out leading and trailing white spaces that might be included in your cell in the process.