Convert Categorical Variables to Numeric in pandas

Galaxy Glossary

How do I convert a categorical variable to numeric in pandas?

Transforming string-based or categorical columns in a pandas DataFrame into numerical representations—such as label encoding or one-hot encoding—so that analytical and machine-learning algorithms can process them.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Definition

In pandas, converting a categorical variable to numeric means representing non-numerical labels (strings, objects, or pandas’ own category dtype) as numbers. This is essential because most analytical and machine-learning algorithms expect numerical input.

Why It Matters

Numeric encoding of categorical features underpins tasks such as regression, classification, clustering, feature selection, visualization, and storage optimization. Concretely, numeric types:

  • Enable mathematical operations and distance metrics
  • Reduce memory footprint when using category codes or int8/int16
  • Serve as a prerequisite for libraries like scikit-learn, XGBoost, LightGBM, and statsmodels
  • Allow quick aggregation and grouping with vectorized operations

Core Techniques

1. Label (Integer) Encoding

Each unique category receives an integer code.

import pandas as pd

df = pd.DataFrame({
"gender": ["female", "male", "female", "male"]
})

# Method A – mapping dictionary
mapping = {"female": 0, "male": 1}
df["gender_code"] = df["gender"].map(mapping)

# Method B – pandas categorical codes
df["gender_cat"] = df["gender"].astype("category").cat.codes

When to use: Ordinal relationships exist (e.g., low < medium < high) or models that handle arbitrary integers (tree-based methods). Avoid with linear models unless ordinal.

2. One-Hot Encoding (Dummy Variables)

Creates a binary column per category.

colors = pd.DataFrame({"color": ["red", "green", "blue", "green"]})

dummies = pd.get_dummies(colors, columns=["color"], prefix="color", dtype="int8")

When to use: Nominal categories without inherent order; widely compatible with linear models. Be mindful of the “curse of dimensionality” when cardinality is high.

3. Ordinal Encoding

Explicitly define ordered categories.

sizes = pd.DataFrame({"size": ["S", "M", "L", "XL", "M"]})
order = ["S", "M", "L", "XL"]

sizes["size_ord"] = pd.Categorical(sizes["size"], categories=order, ordered=True).codes

Ensures that the numeric codes respect business logic (e.g., S < M < L < XL).

4. Binary Encoding, Hashing, and Target Encoding

Advanced schemes (via category_encoders or sklearn.preprocessing) compress high-cardinality features or use target statistics. Apply with cross-validation to avoid leakage.

End-to-End Workflow

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1️⃣ Load data
df = pd.read_csv("customer_churn.csv")

# 2️⃣ Identify categorical columns
cat_cols = df.select_dtypes(include=["object", "category"]).columns

# 3️⃣ Encode (one-hot for nominal, ordinal for ordered)
ord_map = {"Bronze": 0, "Silver": 1, "Gold": 2}
df["membership_level"] = df["membership_level"].map(ord_map)

df = pd.get_dummies(df, columns=[col for col in cat_cols if col != "membership_level"], drop_first=True)

# 4️⃣ Train/test split
X_train, X_test, y_train, y_test = train_test_split(
df.drop("churn", axis=1), df["churn"], test_size=0.2, random_state=42
)

# 5️⃣ Fit model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# 6️⃣ Evaluate
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))

Best Practices

  • Choose encoding by model type: tree-based ⟶ label or target; linear ⟶ one-hot.
  • Control cardinality: group rare labels into “Other” to limit dummy columns.
  • Preserve dtypes: cast dummy columns to int8 to save memory.
  • Keep pipelines reproducible: use sklearn transformers (OneHotEncoder, OrdinalEncoder) or ColumnTransformer.
  • Avoid data leakage: fit encoders on the training set only; transform test/production sets with same mapping.

Common Mistakes & Fixes

Mistake 1: Dropping the Original Column Prematurely

Deleting the categorical column before confirming the encoding worked leads to lost data. Fix: validate encoded columns first, or keep original until end of pipeline.

Mistake 2: Forgetting drop_first=True with One-Hot Encoding

Leaving all dummy columns in can introduce perfect multicollinearity for regression models. Fix: use drop_first=True or patsy style reference coding.

Mistake 3: Implicitly Assuming Ordinal Meaning

Applying integer codes (0,1,2,…) on nominal categories may trick linear models into inferring an order. Fix: one-hot encode or explicitly assign unordered dtype.

Performance Tips

  • For big data, call pd.Categorical early to shrink memory usage.
  • Vectorize mappings; avoid Python apply.
  • Consider .sparse dtypes or external libraries (e.g., cuDF) for millions of rows.

When to Transform Inside the Database

Sometimes it is cheaper to encode in SQL (e.g., CASE WHEN color='red' THEN 1 ELSE 0 END). If your workflow lives primarily in a SQL editor like Galaxy, push simple binary or ordinal transformations to the database to reduce data transfer volume. However, advanced encodings (target, hashing) are usually easier in pandas.

Conclusion

Numeric encoding of categorical variables in pandas is a foundational step in any data-driven project. Mastering the right method—label, one-hot, ordinal or more advanced encoders—ensures model compatibility, mitigates pitfalls, and keeps your pipelines efficient.

Why Convert Categorical Variables to Numeric in pandas is important

Machine-learning and statistical algorithms require numbers, not strings. Encoding categorical data numerically enables model training, boosts performance, saves memory, and prevents analytic errors. Without proper encoding, models may crash, yield biased coefficients, or silently misinterpret categorical relationships.

Convert Categorical Variables to Numeric in pandas Example Usage



Common Mistakes

Frequently Asked Questions (FAQs)

What’s the fastest way to get numeric codes for a column?

Call .astype('category').cat.codes. It converts the column to pandas’ categorical dtype and exposes pre-computed integer codes, which is typically vectorized and memory-efficient.

When should I prefer one-hot over label encoding?

If the categorical feature is nominal (no intrinsic order) and you’re using linear or distance-based models, choose one-hot encoding. Tree-based models can handle label encoding because splits are inequality-based and do not rely on magnitude.

Do I need to drop the first dummy column?

For regression or GLM models that include an intercept, yes—set drop_first=True in pd.get_dummies to avoid the dummy variable trap. For tree-based or regularized models, it’s optional.

How do I keep the same encoding in production?

Use sklearn transformers (e.g., OneHotEncoder, OrdinalEncoder) trained on your training data, serialize the pipeline with joblib, and load the identical mapping during inference to prevent mismatched columns.

Want to learn about other SQL terms?