Applying a single or multiple transformations to several columns of an R data frame in one step using dplyr’s mutate() + across() pattern.
Mutating several columns at once is a common data-wrangling task. Since dplyr 1.0, the canonical way to do it is to combine mutate()
with across()
. The pattern dramatically reduces boiler-plate, prevents copy-pasting errors, and keeps your data pipelines readable, testable and fast.
Data rarely arrives in the shape you need for analysis. Numeric features may require scaling, date columns need standardisation, categorical variables should be recoded, and text fields often need trimming. Performing these tasks column by column quickly becomes error-prone and hard to maintain. The mutate(across())
idiom solves the problem by letting you:
for
loops.mutate()
adds or modifies columns. across()
specifies which columns to touch and what function(s) to apply. The general template is:
data %>%
mutate(across(, , .names = "{col}_new"))
Key arguments:
starts_with("score_")
, where(is.numeric)
).log
), • anonymous function (e.g., ~ .x / 100
), or • named list of functions."{col}"
to keep the original name or "{col}_{fn}"
for multi-function cases.dplyr ships with a rich tidy-select mini-language:
starts_with("prefix")
, ends_with()
, contains()
matches("regex")
where(is.numeric)
, where(is.character)
col1:col5
-c(id, date)
Mix them with |
and &
for complex rules.
library(dplyr)
data_scaled <- data_raw %>%
mutate(across(where(is.numeric), scale))
Every numeric column is centred and scaled. Because scale()
returns a matrix, dplyr keeps the result as numeric vector, preserving column count.
data_aug <- data_raw %>%
mutate(across(starts_with("score_"), list(z = scale, pct = ~ (.x / max(.x))*100),
.names = "{col}_{fn}"))
The list()
names (z
, pct
) automatically suffix the new columns via {fn}
.
Create multiple across()
calls inside one mutate:
data_clean <- data_raw %>%
mutate(across(where(is.character), str_trim),
across(ends_with("_date"), as.Date, format = "%Y-%m-%d"))
Sometimes the new value of a column depends on values from other columns. Two approaches:
rowwise()
to work in row context:data_ratios <- data_raw %>%
rowwise() %>%
mutate(across(c(a, b, c), ~ .x / sum(c_across(c(a, b, c)))))
data_ratios <- data_raw %>%
mutate(total = a + b + c) %>%
mutate(across(a:c, ~ .x / total))
You can build function lists programmatically. Suppose you have a named vector of exponents:
powers <- c(sq = 2, cube = 3, quad = 4)
funs <- purrr::imap(powers, ~ function(x) x ^ .x)
data_pow <- data_raw %>%
mutate(across(where(is.numeric), funs, .names = "{col}_{fn}"))
across()
is C++-optimised when the applied function is vectorised.stringi
for text).group_by()
, operations run per group.across()
over the superseded mutate_at()
/mutate_if()
..names
proactively to avoid overwriting or to create tidy column names.across()
calls instead of nesting long anonymous functions.mutate_at()
, mutate_if()
, and mutate_all()
may be removed in future releases. Migrate to across()
..names
when you intend to keep the originals will silently replace data.lm()
, web requests) will murder performance. Pre-vectorise or move to rowwise()
judiciously.features <- raw %>%
# Step 1: group by customer
group_by(customer_id) %>%
# Step 2: compute behaviour counts
summarise(across(event:audit, sum),
first_seen = min(timestamp),
last_seen = max(timestamp)) %>%
ungroup() %>%
# Step 3: scale numeric counts and keep dates intact
mutate(across(where(is.integer), ~ (.x - min(.x)) / (max(.x) - min(.x)),
.names = "{col}_norm"))
The result is a clean feature table ready for modelling.
Although dplyr
is an R-side data-manipulation library, many teams use it to prototype transformations that later get re-implemented in SQL. Galaxy’s AI copilot can translate a well-structured dplyr
pipeline—especially one built with across()
—into efficient SQL, letting engineers validate logic in R before committing it to the warehouse.
Mastering mutate(across())
unlocks expressive, concise, and high-performance data wrangling. Adopt tidy-selection helpers, name your outputs clearly, and keep functions vectorised. Your future self—and your collaborators—will thank you.
In real-world datasets you often have dozens or hundreds of variables that require identical cleaning steps. Writing one line per column is not only verbose but also risky—typos and inconsistent logic creep in easily. The mutate-across idiom gives data engineers and analysts a single, reproducible source of truth, thereby improving code quality, reducing maintenance cost, and accelerating experimentation.
mutate_at()
was the historical way to apply a function to selected columns. It is now superseded by mutate(across())
, which offers a unified syntax, more powerful tidy-selection, better naming control via .names
, and a clearer mental model.
Supply a .names
pattern that creates new column names, e.g., .names = "{col}_scaled"
. Without it, dplyr will overwrite the originals.
Most likely the function you pass to across()
is not vectorised or calls R-level loops. Rewrite it with vectorised functions or apply rowwise()
only when necessary.
Yes. Galaxy’s AI copilot understands tidyverse syntax and can suggest equivalent SQL, allowing you to prototype in R and productionise in the data warehouse.