In R’s tidyr package, separate() splits a character column into multiple columns based on position or delimiter, whereas extract() parses a column into multiple columns using a regular-expression capture group, preserving original types.
Splitting a single text field into multiple, well-typed columns is a common data-wrangling step. In R, the tidyr
verbs separate()
and extract()
tackle this job, but they do so in subtly different ways. Choosing the right function can save you lines of code, prevent parsing errors, and keep your pipelines readable.
separate()
and extract()
?separate()
tokenizes a character column at a specified delimiter or fixed position and returns the chunks as new columns. extract()
applies a regular expression (regex) to a character column, captures specific substrings with parentheses, and assigns those captures to new columns.
extract()
can accurately isolate variable-length substrings, reducing the risk of misaligned splits when delimiters are inconsistent.separate()
is faster and more declarative.extract()
is better for retaining leading zeros or specific string formats when convert = FALSE
.separate(data, col, into, sep = "[delim]", remove = TRUE, convert = FALSE, ...)
extract(data, col, into, regex = "(pattern)", remove = TRUE, convert = FALSE, ...)
separate()
Workscol
).sep
(regex or numeric positions).into
.convert = TRUE
.extract()
Worksregex
to col
.into
."2023-12-31"
into year, month, day
; splitting GPS coordinates "42.3, -71.0"
."ID:AB-12345-X"
).library(tidyr)
library(dplyr)
orders <- tibble(
order_id = 1001:1003,
line = c("Widget_A|3|$10.99", "Gadget_B|12|$2.50", "Sprocket_C|1|$199.95")
)
# Using separate(): delimiter is "|"
orders_sep <- orders %>%
separate(line, into = c("product", "qty", "price"), sep = "\\|", convert = TRUE)
# Using extract(): remove leading $ from price with regex capture
orders_ext <- orders %>%
extract(line, into = c("product", "qty", "price"),
regex = "([^|]+)\\|([^|]+)\\|\\$([^|]+)", convert = TRUE)
The separate()
call is shorter, but extract()
avoids an extra gsub()
to drop the dollar sign.
stringr::str_extract()
FirstWrite and test your regex outside the pipeline to ensure it behaves as expected before committing it to extract()
.
into
Explicit destination names document intent and prevent mysterious ...1
, ...2
columns.
Both verbs return NA
for unmatched patterns. Wrap with tidyr::replace_na()
or dplyr::coalesce()
when defaults matter.
Leave convert = FALSE
when strings like ZIP codes ("02115"
) must retain leading zeros.
extra = "merge"/"drop"
fields, which can silently discard or combine data.Although tidyr
is an R library, the ideas carry over to SQL data cleaning carried out in Galaxy’s SQL editor. The mental model of choosing delimiter-based splits versus regex extraction mirrors SQL functions like SPLIT_PART()
or REGEXP_SUBSTR()
. Understanding these trade-offs helps developers translate R prototypes into optimized SQL with Galaxy’s AI copilot.
separate()
for straightforward, delimiter-driven splits.extract()
when delimiters vary or you need regex finesse.separate()
patterns to string split functions and extract()
patterns to regex functions.Data rarely arrives in perfectly normalized form. Dates, codes, and multi-value fields are often packed into single strings that must be split before analysis. Knowing when to apply delimiter-based splitting versus regex-based extraction accelerates cleaning, avoids edge-case errors, and yields reproducible pipelines that translate cleanly to production SQL environments such as Galaxy.
Yes. With separate()
you can supply a single regex like "[;|,]"
that matches semicolon, pipe, or comma. For complex logic, use extract()
.
Set remove = FALSE
. Both separate() and extract() leave the source column untouched when this flag is used.
Regex introduces overhead, but for data sets under a few million rows the difference is marginal. Profile your pipeline rather than assuming.
Galaxy’s AI copilot can suggest SQL equivalents—such as REGEXP_SUBSTR() or SPLIT_PART()—when you paste your R code or describe the transformation in chat.