Transforming text-based or labeled categorical columns into machine-readable numeric representations using pandas utilities such as get_dummies(), astype('category').cat.codes, and factorize().
Convert Categorical Variables to Numeric in Pandas
Learn why and how to turn text or label columns into numbers in pandas using one-hot encoding, ordinal encoding, and binary labeling—plus best practices, mistakes to avoid, and production tips.
Most real-world data sets contain categorical variables—columns that describe qualities rather than quantities, such as color = "red"
or user_type = "admin"
. Machine-learning algorithms, analytic aggregates, and relational databases generally demand numeric input. Converting categorical variables to numeric values is therefore a cornerstone of data cleaning and feature engineering in Python's pandas
library.
int8
or int16
.pd.get_dummies()
)Creates a separate binary column for every category. Works well for nominal (unordered) categories with limited cardinality.
encoded = pd.get_dummies(df, columns=["color"], prefix="color")
astype('category').cat.codes
Assigns an integer code (0,1,2,…) to each distinct value. Best for ordinal features (e.g., small < medium < large
) or when model accepts arbitrary integer labels (e.g., LightGBM).
df["size_code"] = df["size"].astype("category").cat.codes
pd.factorize()
factorize()
returns two objects: an array of codes and an Index of unique values. Useful for fast, repeated conversions.
codes, uniques = pd.factorize(df["user_type"])
df["user_type_code"] = codes
map
or replace
)When categories have known semantics, you can supply your own mapping dictionary.
grade_map = {"A":4, "B":3, "C":2, "D":1, "F":0}
df["gpa_points"] = df["grade"].map(grade_map)
import pandas as pd
raw = pd.DataFrame({
"color": ["red","green","blue","green"],
"size": ["S","M","L","S"],
"price": [10.99, 12.99, 15.49, 9.99]
})
# 1️⃣ One-hot encode color
one_hot = pd.get_dummies(raw["color"], prefix="color")
# 2️⃣ Ordinal encode size because it has logical order
size_order = {"S":1, "M":2, "L":3}
raw["size_code"] = raw["size"].map(size_order)
# 3️⃣ Concatenate features
features = pd.concat([one_hot, raw[["size_code", "price"]]], axis=1)
print(features)
The resulting features
DataFrame is now purely numeric and ready for modeling or loading into a data warehouse.
High-cardinality columns (e.g., 10,000 ZIP codes) can bloat one-hot encodings. Apply hashing tricks, frequency thresholding, or keep as category
dtype instead.
Store the cat.categories
or mapping dictionaries to assure consistent encoding between training and production pipelines.
After encoding, downcast to int8
or int16
to save memory:
df["user_type_code"] = df["user_type_code"].astype("int8")
In streaming or real-time scoring, new categories may appear. Build a fallback (code = -1
) or update the mapping regularly.
Wrap transformations in sklearn
ColumnTransformer
or Pipeline
objects for reproducibility.
False. Tree-based ensembles handle integer labels well and may perform better with ordinal encoding.
astype('category')
only sets dtype; you still need .cat.codes
to get numeric values.
factorize()
uses first appearance, not sort order. If order matters, define it explicitly.
In ETL pipelines, convert early—ideally before persisting to Parquet or a data warehouse. Numeric columns compress better and avoid costly string comparisons in SQL engines. After loading, BI tools and SQL editors like Galaxy run aggregations faster on integers than strings.
Transforming categorical variables into numeric form is a foundational skill for data engineers, analysts, and scientists. Whether you need one-hot vectors for linear models or compact int codes for dashboards, pandas provides flexible, performant options to suit every use case.
Almost every machine-learning model and most SQL engines require numeric input. Converting categorical columns to numbers ensures algorithm compatibility, minimizes storage costs, and speeds up analytical queries. In the modern data stack, this step bridges raw data collection and production-grade features consumed by models or dashboards.
Use one-hot for nominal categories without inherent order and limited cardinality. Opt for ordinal or integer codes when the categories are ordered or when your model (e.g., decision trees) can handle integers efficiently.
Persist the mapping: save cat.categories
or the dictionary you used. For scikit-learn, pickle the fitted encoder objects, or export them using joblib
.
Set unknowns to a reserved code (e.g., -1) or periodically retrain and redistribute the encoder to include the new categories.
Yes. Numeric columns execute aggregations and joins faster in databases. When you query those tables in Galaxy, the editor benefits from reduced latency and lower memory consumption compared with string columns.