Transforming string-based or categorical columns in a pandas DataFrame into numerical representations—such as label encoding or one-hot encoding—so that analytical and machine-learning algorithms can process them.
In pandas, converting a categorical variable to numeric means representing non-numerical labels (strings, objects, or pandas’ own category
dtype) as numbers. This is essential because most analytical and machine-learning algorithms expect numerical input.
Numeric encoding of categorical features underpins tasks such as regression, classification, clustering, feature selection, visualization, and storage optimization. Concretely, numeric types:
category
codes or int8/int16
Each unique category receives an integer code.
import pandas as pd
df = pd.DataFrame({
"gender": ["female", "male", "female", "male"]
})
# Method A – mapping dictionary
mapping = {"female": 0, "male": 1}
df["gender_code"] = df["gender"].map(mapping)
# Method B – pandas categorical codes
df["gender_cat"] = df["gender"].astype("category").cat.codes
When to use: Ordinal relationships exist (e.g., low < medium < high
) or models that handle arbitrary integers (tree-based methods). Avoid with linear models unless ordinal.
Creates a binary column per category.
colors = pd.DataFrame({"color": ["red", "green", "blue", "green"]})
dummies = pd.get_dummies(colors, columns=["color"], prefix="color", dtype="int8")
When to use: Nominal categories without inherent order; widely compatible with linear models. Be mindful of the “curse of dimensionality” when cardinality is high.
Explicitly define ordered categories.
sizes = pd.DataFrame({"size": ["S", "M", "L", "XL", "M"]})
order = ["S", "M", "L", "XL"]
sizes["size_ord"] = pd.Categorical(sizes["size"], categories=order, ordered=True).codes
Ensures that the numeric codes respect business logic (e.g., S < M < L < XL
).
Advanced schemes (via category_encoders
or sklearn.preprocessing
) compress high-cardinality features or use target statistics. Apply with cross-validation to avoid leakage.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 1️⃣ Load data
df = pd.read_csv("customer_churn.csv")
# 2️⃣ Identify categorical columns
cat_cols = df.select_dtypes(include=["object", "category"]).columns
# 3️⃣ Encode (one-hot for nominal, ordinal for ordered)
ord_map = {"Bronze": 0, "Silver": 1, "Gold": 2}
df["membership_level"] = df["membership_level"].map(ord_map)
df = pd.get_dummies(df, columns=[col for col in cat_cols if col != "membership_level"], drop_first=True)
# 4️⃣ Train/test split
X_train, X_test, y_train, y_test = train_test_split(
df.drop("churn", axis=1), df["churn"], test_size=0.2, random_state=42
)
# 5️⃣ Fit model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# 6️⃣ Evaluate
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
int8
to save memory.sklearn
transformers (OneHotEncoder
, OrdinalEncoder
) or ColumnTransformer
.Deleting the categorical column before confirming the encoding worked leads to lost data. Fix: validate encoded columns first, or keep original until end of pipeline.
drop_first=True
with One-Hot EncodingLeaving all dummy columns in can introduce perfect multicollinearity for regression models. Fix: use drop_first=True
or patsy
style reference coding.
Applying integer codes (0,1,2,…) on nominal categories may trick linear models into inferring an order. Fix: one-hot encode or explicitly assign unordered dtype.
pd.Categorical
early to shrink memory usage.apply
..sparse
dtypes or external libraries (e.g., cuDF) for millions of rows.Sometimes it is cheaper to encode in SQL (e.g., CASE WHEN color='red' THEN 1 ELSE 0 END
). If your workflow lives primarily in a SQL editor like Galaxy, push simple binary or ordinal transformations to the database to reduce data transfer volume. However, advanced encodings (target, hashing) are usually easier in pandas.
Numeric encoding of categorical variables in pandas is a foundational step in any data-driven project. Mastering the right method—label, one-hot, ordinal or more advanced encoders—ensures model compatibility, mitigates pitfalls, and keeps your pipelines efficient.
Machine-learning and statistical algorithms require numbers, not strings. Encoding categorical data numerically enables model training, boosts performance, saves memory, and prevents analytic errors. Without proper encoding, models may crash, yield biased coefficients, or silently misinterpret categorical relationships.
Call .astype('category').cat.codes
. It converts the column to pandas’ categorical dtype and exposes pre-computed integer codes, which is typically vectorized and memory-efficient.
If the categorical feature is nominal (no intrinsic order) and you’re using linear or distance-based models, choose one-hot encoding. Tree-based models can handle label encoding because splits are inequality-based and do not rely on magnitude.
For regression or GLM models that include an intercept, yes—set drop_first=True
in pd.get_dummies
to avoid the dummy variable trap. For tree-based or regularized models, it’s optional.
Use sklearn
transformers (e.g., OneHotEncoder
, OrdinalEncoder
) trained on your training data, serialize the pipeline with joblib
, and load the identical mapping during inference to prevent mismatched columns.