Convert Categorical Variables to Numeric in Pandas

Galaxy Glossary

How do I convert categorical variables to numeric values in pandas?

Transforming text-based or labeled categorical columns into machine-readable numeric representations using pandas utilities such as get_dummies(), astype('category').cat.codes, and factorize().

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Convert Categorical Variables to Numeric in Pandas

Learn why and how to turn text or label columns into numbers in pandas using one-hot encoding, ordinal encoding, and binary labeling—plus best practices, mistakes to avoid, and production tips.

Introduction

Most real-world data sets contain categorical variables—columns that describe qualities rather than quantities, such as color = "red" or user_type = "admin". Machine-learning algorithms, analytic aggregates, and relational databases generally demand numeric input. Converting categorical variables to numeric values is therefore a cornerstone of data cleaning and feature engineering in Python's pandas library.

Why Convert Categorical Data?

  • Model Compatibility: Algorithms like linear regression and XGBoost accept only numbers.
  • Performance: Numeric dtype reduces memory footprint when stored as int8 or int16.
  • SQL & BI Tools: Data warehouses often favor integers over strings for joins and aggregations, resulting in faster queries.
  • Consistency: Encoded categories avoid typos (e.g., "NY" vs "N.Y.") by mapping to the same code.

Core Techniques

1. One-Hot Encoding (pd.get_dummies())

Creates a separate binary column for every category. Works well for nominal (unordered) categories with limited cardinality.

encoded = pd.get_dummies(df, columns=["color"], prefix="color")

  • Pros: No implicit ordering; suitable for tree-based models.
  • Cons: Explodes dimensionality for high-cardinality columns.

2. Ordinal Encoding With astype('category').cat.codes

Assigns an integer code (0,1,2,…) to each distinct value. Best for ordinal features (e.g., small < medium < large) or when model accepts arbitrary integer labels (e.g., LightGBM).

df["size_code"] = df["size"].astype("category").cat.codes

3. pd.factorize()

factorize() returns two objects: an array of codes and an Index of unique values. Useful for fast, repeated conversions.

codes, uniques = pd.factorize(df["user_type"])
df["user_type_code"] = codes

4. Custom Mapping (map or replace)

When categories have known semantics, you can supply your own mapping dictionary.

grade_map = {"A":4, "B":3, "C":2, "D":1, "F":0}
df["gpa_points"] = df["grade"].map(grade_map)

End-to-End Example

import pandas as pd

raw = pd.DataFrame({
"color": ["red","green","blue","green"],
"size": ["S","M","L","S"],
"price": [10.99, 12.99, 15.49, 9.99]
})

# 1️⃣ One-hot encode color
one_hot = pd.get_dummies(raw["color"], prefix="color")

# 2️⃣ Ordinal encode size because it has logical order
size_order = {"S":1, "M":2, "L":3}
raw["size_code"] = raw["size"].map(size_order)

# 3️⃣ Concatenate features
features = pd.concat([one_hot, raw[["size_code", "price"]]], axis=1)
print(features)

The resulting features DataFrame is now purely numeric and ready for modeling or loading into a data warehouse.

Best Practices

Profile Cardinality

High-cardinality columns (e.g., 10,000 ZIP codes) can bloat one-hot encodings. Apply hashing tricks, frequency thresholding, or keep as category dtype instead.

Persist Category Mappings

Store the cat.categories or mapping dictionaries to assure consistent encoding between training and production pipelines.

Use the Smallest Integer Type

After encoding, downcast to int8 or int16 to save memory:

df["user_type_code"] = df["user_type_code"].astype("int8")

Handle Unknown Categories

In streaming or real-time scoring, new categories may appear. Build a fallback (code = -1) or update the mapping regularly.

Pipeline Integration

Wrap transformations in sklearn ColumnTransformer or Pipeline objects for reproducibility.

Common Misconceptions

"One-hot is always better"

False. Tree-based ensembles handle integer labels well and may perform better with ordinal encoding.

"astype('category') automatically makes numbers"

astype('category') only sets dtype; you still need .cat.codes to get numeric values.

"Mapping preserves alphabetical order"

factorize() uses first appearance, not sort order. If order matters, define it explicitly.

Putting It in Production

In ETL pipelines, convert early—ideally before persisting to Parquet or a data warehouse. Numeric columns compress better and avoid costly string comparisons in SQL engines. After loading, BI tools and SQL editors like Galaxy run aggregations faster on integers than strings.

Conclusion

Transforming categorical variables into numeric form is a foundational skill for data engineers, analysts, and scientists. Whether you need one-hot vectors for linear models or compact int codes for dashboards, pandas provides flexible, performant options to suit every use case.

Why Convert Categorical Variables to Numeric in Pandas is important

Almost every machine-learning model and most SQL engines require numeric input. Converting categorical columns to numbers ensures algorithm compatibility, minimizes storage costs, and speeds up analytical queries. In the modern data stack, this step bridges raw data collection and production-grade features consumed by models or dashboards.

Convert Categorical Variables to Numeric in Pandas Example Usage


df["device_code"] = df["device_type"].astype("category").cat.codes

Convert Categorical Variables to Numeric in Pandas Syntax



Common Mistakes

Frequently Asked Questions (FAQs)

When should I use one-hot encoding vs. ordinal encoding?

Use one-hot for nominal categories without inherent order and limited cardinality. Opt for ordinal or integer codes when the categories are ordered or when your model (e.g., decision trees) can handle integers efficiently.

How do I ensure the same encoding in production?

Persist the mapping: save cat.categories or the dictionary you used. For scikit-learn, pickle the fitted encoder objects, or export them using joblib.

What happens if new categories appear during inference?

Set unknowns to a reserved code (e.g., -1) or periodically retrain and redistribute the encoder to include the new categories.

Is converting categories beneficial for SQL editors like Galaxy?

Yes. Numeric columns execute aggregations and joins faster in databases. When you query those tables in Galaxy, the editor benefits from reduced latency and lower memory consumption compared with string columns.

Want to learn about other SQL terms?